Scrape Website Data Without Headless Browser, Using Node Js

During the course of web development, we occasionally need to scrape a website. Not just one page but many. On such requirement, mostly because the host site does not allow otherwise, we naturally tend to favor the use of headless browsers such as phantomjs or casperjs. Apparently the choice is obvious, but is it? Do we really need a headless browser to scrape all websites?

How To Confirm The Website Needs Headless Browser Or Not?

1. Rule Out Single Page Application (SPA)

If the website to be scraped is an SPA, or even if it’s not and yet fetches the contents of the page via API (say using JQuery), then it cannot be scraped without headless browser. Although it should be evident while using the app, still you can confirm if it’s SPA or not, by opening the site in Google Chrome, opening developer’s console and checking the network tab. If API calls are being made to fetch JSON content, then it’s an SPA.

2. Request The Webpage With A Script And Observe Its Content

Create a js file, say confirm.js, paste the code below (install request first), change the url with the website’s link you want to scrape, and save.

const request = require('request');

request("https://www.sitetoscrape.com/resource/1", function (error, response, body) {
  if (error){
    console.log(error);
  }
  else {
    console.log(body);
  }
});

Now in terminal run node confirm.js.

Then observe the contents returned, and find one of the keywords/expressions that you want to scrape (a specific name, address, phone etc.). You can do this on terminal with Ctrl + F (or CMD + F in Mac). If you manage to find, the site is scrapable without headless browser, otherwise not.

How To Scrape

Now you have passed the scrapability test, let’s actually scrape it. First, install cheerio in addition to request.

Commonly, scrapable websites structure data in two ways, both need scraping differently.

1- Data In HTML

This is the traditional way, and straight forward to scrape. HTML comes prepared from the server, and renders on the browser.

For this, run the following code.


const cheerio = require('cheerio');
const request = require('request');

request("https://www.sitetoscrape.com/resource/1", function (error, response, body) {
  if (error) {
    console.log(error);
  }
  else if (body) {
    const $ = cheerio.load(body);
    const userData = {
      name: $('.user-info .contact .name').text(),
      phone: $('.user-info .contact .phone').text(),
      address: $('.user-info .contact .address').text()      
    }
    console.log(userData);
  }
  else {
    console.log("No body");
  }
});

Let’s go through the above script:

Request the webpage
Check for error
Confirm body exists, and load the body with cheerio in $
Now extract name, phone and address, using css selectors, as you would do with JQuery. (You need to identify the correct selectors for each piece of information by observing the HTML of the webpage).

And we are done!

2- Data In Script Tag

There are websites, where data does not come rendered as a complete HTML, rather it comes as a JSON in script tag, from where it’s picked up and put into DOM on browser side using javascript. Usually, it takes this form:


<div id="root">...</div>
<script>
  window.__PRELOADED_STATE__ = {//json data that's inserted in HTML, inside #root, using javascript on browser side}
</script>

If you have run the second test for scrapability, it would pass nevertheless because the data though present in script tag still comes in body. However, it’s scraping will be a little different as shown below.


const request = require('request');
const cheerio = require('cheerio')

request("https://www.sitetoscrape.com/resource/1", function (error, response, body) {
  if (error) {
    console.log(error)
  }
  else if (body) {
    const $ = cheerio.load(body);
    try {
      const __PRELOADED_STATE__ = JSON.parse($('#root').next().text().replace("window.__PRELOADED_STATE__ = ", ""));
      const profile = __PRELOADED_STATE__.profile;
      const p = {
        phone: profile.contactInformation.phone,
        name: profile.names.primary,
        address: profile.addresses.primary,
      }
      console.log(p);
    }
    catch (e) {
      console.log("Error parsing profile", e);
    }
  }
  else {
    console.log("No body");
  }
});

Get the webpage
Look for error
Confirm body exists, and load the body with cheerio in $
In try block, extract text of the block next to #root (which is the script tag we are interested in; the script tag in your scraped data should pe placed differently, so you may need to identify another way to reach it), replace the assignment window.__PRELOADED_STATE__ = with nothing. And we are left with pure JSON, which we parse with JSON.parse
catch if anything goes wrong with parsing
Get the relevant values from the JSON!

Advantages

Two notable advantages of this technique over headless browsers:

It’s super fast
It takes way less processing and resources

Using Proxy Server

You must be cautious in making multiple calls to a particular website for scraping, as it will soon result in captchas or forbidden 403 status thrown by the website server.

A proxy service rotates the IP origin of each and every call you make, even if they are fired simultaneously. For multiple or parallel calls, therefore, use of a proxy service is the only way to not get detected and thwarted by the website.

Some big names in proxy service industry that you can consider include Trusted Proxies and netnut. I have used both with satisfactory results.

To use a proxy server, simply pass it as proxy value in request options. The above example will become:

const options = {    
    proxy: 'http://bigg-xx-xxxx.xx-xx.com ',
    method: 'GET',
    url: https://www.sitetoscrape.com/resource/1
  }

  request(options, function (error, response, body) {
  //...
  });