Web Scraping is basically a process of extracting data from a website using some scripts or automation tool/software. In this article, we will discuss in detail on scraping the review and information about the doctors from various medical field-oriented websites using Scrapy and Selenium tools.
Web Scraping – Purpose
The list of activities one can perform using web scraping is endless. We have put some of the most common ones below.
Job Aggregator: There are people actively looking for jobs and there are companies looking to hire suitable manpower. The problem is there are tons of job boards with a lot of listings. What if you can scrape the job links and title, put it in a single place from where the job seeker can get the details.
Scraping Reviews: Reviews are important for businesses to know better about their customer. This gives the better understand of their customers & improve their services.
Brand Monitoring: In today’s highly competitive market, it’s a top priority to protect your online reputation. Whether you sell your products online and have a strict pricing policy that you need to enforce or just want to know how people perceive your products online, brand monitoring with web scraping can give you this kind of information.
Price Monitoring: Price monitoring is a very common yet useful technique that we can use to automate the process of checking prices on various websites.
Web Scraping – Tools
Scrapy is a python crawling framework, used to extract the data from the web page with the help of a selector based on XPath.
Selenium is a UI automation tool used for web scraping. Scrapy is a very powerful web scraping framework, however, it has some limitations. E.g.: If we need to extract mobile numbers from healthgrades.com or similar sites, but the mobile number is displayed only after the user clicks the “show mobile number” button, we need to use Selenium to execute the click event.
MongoDB – The scrapped information like the review text, date, rating, reviewer details, physician details etc., are stored in a NoSQL database like MongoDB. MongoDB is an open-source document database and leading NoSQL database.
MongoDB uses the following hierarchy of artifacts to store information. They are,
The database is a physical container for collections. Each database gets its own set of files on the file system. A single MongoDB server typically has multiple databases.
A collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection exists within a single database. Collections do not enforce a schema. Documents within a collection can have different fields. Typically, all documents in a collection are of similar or related purposes.
The document is a set of key-value pairs. Documents have a dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.
The following is a class diagram of how the data scrapped is stored in the Mongo database.
- Firstly, we must identify the target website and then collect URLs of the pages where we want to extract data from.
- Once we make a request to these URLs our scrapy spiders crawl websites and return the HTML of the page after that we must use locators to find the data in that HTML.
- Once the Data extracted, we can save it in our database. For extracting the data from HTML, we can use either a CSS selector or XPath.
- Scrapy is inbuilt with a lot of features for formatting the data stored it into multiple databases such as SQL or NoSQL without hassle.
Scrapped Data – Sample Screenshot
At OptiSol, we perform web scraping as a 3-stage model.
Initially, our BI analyst team will get the source applications from which data needs to be extracted or scraped.
Web scraping will be done to scrape and transfer data from a website to a new datastore.
The data fetched from multiple source systems may be structured or unstructured data.
Then the extracted data will be cleaned up and validated before loading it into a common database.