April 6, 2020 No Comments

What is Web Crawler?

Web Crawler is a program that collects content from the web. Web Crawler is also known as spiders, robots, bots, etc.

Let’s take an example to understand what the crawlers do. A website’s Home page may have links for other pages like Services, About, Contact, Career, etc. Now, these pages may further have links for other pages. With the help of the Crawler, we can find each and every page of that website.

Most of the web pages can be crawled to extract information. All web pages on the internet are not the same, they have different web elements and structures. Hence, you need to write your own web crawler for the page that you want to extract.

Why to use Scrapy?

We have used BeautifulSoup and lxml libraries for parsing HTML and XML for scrapping but Scrapy is a full-fledged application framework specifically for writing web spiders that can crawl websites and extract data from them.

Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead. After all, they’re just parsing libraries that can be imported and used from any Python code. BeautifulSoup is easy to understand and can be used for smaller task while one needs some time to understand Scrapy but once you are done, you know a lot of ways to scarpe a web page.

How we did it using Scrapy?

Following are the steps to perform web crawling using scrapy.

Install & Create Virtual Environment in Python

Install Virtual Environment

If you don’t have virtual environment, you can install it using following command :

				
					sudo apt install virtualenv
				
			

Create and activate Virtual Environment

You can create the virtual environment by the given command :

				
					virtualenv env_name --python=python3
				
			

Once you have created the virtualenv, activate it using the following command :

				
					source env_name/bin/activate
				
			

Install Scrapy

If you are using conda then to install Scrapy using conda, run :

				
					conda install -c conda-forge scrapy
				
			

Or you can install Scrapy from the PYPI with pip command:

				
					pip install Scrapy
				
			

Create a Scrapy Project

We need to first create a Scrapy project for that run

				
					scrapy startproject MySpider
				
			

This will create a MySpider directory with following content

  • MySpider/
    • scrapy.cfg            # deploy configuration file
    • MySpider/             # project’s Python module, you’ll import your code from here
      •  __init__.py
      • items.py          # project items definition file
      • middlewares.py    # project middlewares file
      • pipelines.py      # project pipelines file
      • settings.py       # project settings file
      • spiders/          # a directory where you’ll later put your spiders
        • __init__.py

Now, we will create our spider file name MySpider.py, create this file under the director Spiders at MySpider/MySpider/spiders. Add the following code in the MySpider.py file.

				
					import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class SpiderForBlog(CrawlSpider):
    name = "test"
    allowed_domains = ["TESTDOMAIN.com"]
    start_urls = ["https://www.TESTDOMAIN.com/"]

    rules = (Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),)
    
    def parse_item(self, response):
        filename = response.url
        file1 = open("file_test.txt", "a+")
        string = str(filename)
        file1.write(string+ '\n')        
        file1.close()
				
			

In the above code, instead of TESTDOMAIN, you can use another domain name on which you want to perform the crawling.

To test it, run the following command:

				
					scrapy crawl --nolog test
				
			

(where “test” is the name of the spider). You will get a file named “file_test.txt” in which you can find all the URLs of the domain.

If you want to access the title or h1 element or any other element from the page then you can do it by the following command:

				
					title = response.css('title::text').get()
h1 = response.css('h1::text').get()
				
			

Write this in the parse_item(self, response) function.

Storing Output in Database - MySQL:

Now you create another spider named MySpiderDB.py in the MySpider/Myspider/Spider directory. 

In this spider file, we will store all the URLs of the given domain in the database

In order to store data in database, first you need to download MySQL in your machine.

Once you have successfully installed the MySQL, now install the MySQL connector for python. You can install the connector in your virtualenv. Below is the command to installed that:

				
					pip install mysql-connector-python
				
			

Now create a Database in MySQL, create a new table with a single field name `url_name`.

After installing everything, add the following code in MySpiderDB.py

				
					import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

import mysql.connector
from mysql.connector import errorcode

#MySQL Connection
try:
    # mydb and cursor are instance variables
    # use database credentials and name for connections
    mydb = mysql.connector.connect(host='LOCALHOST', user='USERNAME', password='YOUR_PASSWORD', database='DB_NAME')
    # cursor is used to iterate through the result set obtained from query
    cursor = mydb.cursor(buffered=True)
    # cursor.execute("set names utf8;")
    print("Connected")
except mysql.connector.Error as err:
    if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
        print("Something is wrong with your username or password")
    elif err.errno == errorcode.ER_BAD_DB_ERROR:
        print("Database does not exist")
    else:
        print(err)


class MySpider(CrawlSpider):
    name = "test_db"
    allowed_domains = ["TESTDOMAIN.com"]
    start_urls = ["https://www.TESTDOMAIN.com/"]
    
    
    # Storing the URL in a database
    rules = (Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),)
    
    def parse_item(self, response):               
        x = response.url
        y = str(x)      
        name = ( y )
        print("Value to be inserted is ===> ",y)
        myquery = 'INSERT IGNORE INTO `DB_NAME`.`TABLE_NAME` (`url_name`) VALUE ( "'+y+'" );'
        cursor.execute( myquery )
        mydb.commit()
				
			

In the above code, instead of TESTDOMAIN, you can use another domain name on which you want to perform the crawling.

Now Check your MySQL table, you can find all the URLs of the domain in the table. INSERT IGNORE will not store the duplicate URL.

Write a comment

Your email address will not be published. Required fields are marked *

Want to talk to an Expert Developer?

Our experts in Generative AI, Python Programming, and Chatbot Development can help you build innovative solutions and scale your business faster.