During the course of web development, we occasionally need to scrape a website. Not just one page but many. On such requirement, mostly because the host site does not allow otherwise, we naturally tend to favor the use of headless browsers such as phantomjs or casperjs. Apparently the choice is obvious, but is it? Do we really need a headless browser to scrape all websites?
How To Confirm The Website Needs Headless Browser Or Not?
1. Rule Out Single Page Application (SPA)
If the website to be scraped is an SPA, or even if it’s not and yet fetches the contents of the page via API (say using JQuery), then it cannot be scraped without headless browser. Although it should be evident while using the app, still you can confirm if it’s SPA or not, by opening the site in Google Chrome, opening developer’s console and checking the network tab. If API calls are being made to fetch JSON content, then it’s an SPA.
2. Request The Webpage With A Script And Observe Its Content
Create a js file, say confirm.js
, paste the code below (install request first), change the url with the website’s link you want to scrape, and save.
const request = require('request');
request("https://www.sitetoscrape.com/resource/1", function (error, response, body) {
if (error){
console.log(error);
}
else {
console.log(body);
}
});
Now in terminal run node confirm.js
.
Then observe the contents returned, and find one of the keywords/expressions that you want to scrape (a specific name, address, phone etc.). You can do this on terminal with Ctrl + F (or CMD + F in Mac). If you manage to find, the site is scrapable without headless browser, otherwise not.
How To Scrape
Now you have passed the scrapability test, let’s actually scrape it. First, install cheerio in addition to request.
Commonly, scrapable websites structure data in two ways, both need scraping differently.
1- Data In HTML
This is the traditional way, and straight forward to scrape. HTML comes prepared from the server, and renders on the browser.
For this, run the following code.
const cheerio = require('cheerio');
const request = require('request');
request("https://www.sitetoscrape.com/resource/1", function (error, response, body) {
if (error) {
console.log(error);
}
else if (body) {
const $ = cheerio.load(body);
const userData = {
name: $('.user-info .contact .name').text(),
phone: $('.user-info .contact .phone').text(),
address: $('.user-info .contact .address').text()
}
console.log(userData);
}
else {
console.log("No body");
}
});
Let’s go through the above script:
- Request the webpage
- Check for error
- Confirm body exists, and load the body with cheerio in
$
- Now extract name, phone and address, using css selectors, as you would do with JQuery. (You need to identify the correct selectors for each piece of information by observing the HTML of the webpage).
And we are done!
2- Data In Script Tag
There are websites, where data does not come rendered as a complete HTML, rather it comes as a JSON in script tag, from where it’s picked up and put into DOM on browser side using javascript. Usually, it takes this form:
<div id="root">...</div>
<script>
window.__PRELOADED_STATE__ = {//json data that's inserted in HTML, inside #root, using javascript on browser side}
</script>
If you have run the second test for scrapability, it would pass nevertheless because the data though present in script tag still comes in body. However, it’s scraping will be a little different as shown below.
const request = require('request');
const cheerio = require('cheerio')
request("https://www.sitetoscrape.com/resource/1", function (error, response, body) {
if (error) {
console.log(error)
}
else if (body) {
const $ = cheerio.load(body);
try {
const __PRELOADED_STATE__ = JSON.parse($('#root').next().text().replace("window.__PRELOADED_STATE__ = ", ""));
const profile = __PRELOADED_STATE__.profile;
const p = {
phone: profile.contactInformation.phone,
name: profile.names.primary,
address: profile.addresses.primary,
}
console.log(p);
}
catch (e) {
console.log("Error parsing profile", e);
}
}
else {
console.log("No body");
}
});
- Get the webpage
- Look for error
- Confirm body exists, and load the body with cheerio in
$
- In
try
block, extract text of the block next to#root
(which is the script tag we are interested in; the script tag in your scraped data should pe placed differently, so you may need to identify another way to reach it), replace the assignmentwindow.__PRELOADED_STATE__ =
with nothing. And we are left with pure JSON, which we parse withJSON.parse
catch
if anything goes wrong with parsing- Get the relevant values from the JSON!
Advantages
Two notable advantages of this technique over headless browsers:
- It’s super fast
- It takes way less processing and resources
Using Proxy Server
You must be cautious in making multiple calls to a particular website for scraping, as it will soon result in captchas or forbidden 403 status thrown by the website server.
A proxy service rotates the IP origin of each and every call you make, even if they are fired simultaneously. For multiple or parallel calls, therefore, use of a proxy service is the only way to not get detected and thwarted by the website.
Some big names in proxy service industry that you can consider include Trusted Proxies and netnut. I have used both with satisfactory results.
To use a proxy server, simply pass it as proxy
value in request options. The above example will become:
const options = {
proxy: 'http://bigg-xx-xxxx.xx-xx.com ',
method: 'GET',
url: https://www.sitetoscrape.com/resource/1
}
request(options, function (error, response, body) {
//...
});
See also
- Node JS Mongo Client for Atlas Data API
- SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.
- Exactly Same Query Behaving Differently in Mongo Client and Mongoose
- MongoDB Single Update Query to Change the Field Name in All Matching Documents of the Collection
- AWS Layer: Generate nodejs Zip Layer File Based on the Lambda's Dependencies
- In Node JS HTML to PDF conversion, Populate Images From URLs
- Convert HTML to PDF in Nodejs