[Part 1] - Understanding/Evaluating Web Scraping, Crawling, and Automation with NodeJS Libraries

This article is part of Open-Source Bolster Engineering and Research aimed at evaluating the performance and working of various libraries available in NodeJS for Web Scraping. In part 1 of this blog, we will see how we can use different libraries in Node JS to implement web scraping. In the subsequent blog posts, we will analyze various NodeJS libraries.

Prerequisite

Good knowledge of HTML DOM (Hyper Text Markup Language and Document Object Model) and JS (JavaScript) is recommended!

You can learn before beginning this article from the following link.

https://developer.mozilla.org/

So, let’s begin here!

Web Scraping is about extracting information from web pages. A website can consist of various types of information, including text, images, audio, videos, scripts, and forms. Before beginning, I would like to clarify the concept of crawling over scraping.

Crawling vs Scraping in Web

When we want to search for some information, crawling is the way. When we want to extract information, scraping is the way. So web crawling would mean movement through links or URLs and web scraping means the extraction of information from a particular page/website.

Consider the following example: you want to find a person's contact information from a website. Crawling can help find a specific page, like a contact page or about us page, and scraping can help get the contact information of the person.

Have you heard of Web Automation?

When reading about web crawling and scraping, we often encounter the term "web automation". Once scraping is carried out, we can automate tasks like form submission, data extraction, testing, and validation. We will discuss some web automation techniques in the upcoming articles.

We will use various libraries in NodeJS to demonstrate the quick implementation of scraping. We will scrap the content of the title tag in this article using various libraries.

JSDOM

As per the official documentation, jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. In general, the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.

Cheerio

As per its official guide, Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript which is common for a SPA (single page application). This makes Cheerio much, much faster than other solutions. If your use case requires any of this functionality, you should consider browser automation software like Puppeteer and Playwright or DOM emulation projects like JSDom.

Playwright

As per its official guide, Playwright can either be used as a part of the Playwright Test test runner (this guide), or as a Playwright Library.

Playwright Test was created specifically to accommodate the needs of end-to-end testing. It does everything you would expect from the regular test runner and more. Playwright test allows to:

  • Run tests across all browsers.
  • Execute tests in parallel.
  • Enjoy context isolation out of the box.
  • Capture videos, screenshots, and other artifacts on failure.
  • Integrate your POMs as extensible fixtures

Puppeteer

As per the official guide, Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

What can I do?

Most things that you can do manually in the browser can be done using Puppeteer!

Here are a few examples to get you started:

  • Generate screenshots and PDFs of pages.
  • Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e., "SSR" (Server-Side Rendering)).
  • Automate form submission, UI testing, keyboard input, etc.
  • Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
  • Capture a timeline trace of your site to help diagnose performance issues.
  • Test Chrome Extensions.

Other Scraping Libraries

During our quick run, we evaluated libraries that provided APIs to extract the title from the document of the requested URL. There are other libraries that you can try for scraping. They are Osmosis, and X-RAY which are more equipped with testing components. There are popular and advanced automation tools like Cypress and Selenium. Terms like web crawling, scraping or automation tools are found to be used interchangeably, but on functional consideration, they differ heavily.

There are various paid scraping options. These provide dashboards and tools to scrap websites. A simple search can land you multiple options.

Selection and comparison of Scraping Libraries

We will compare these libraries in upcoming articles. When considering using any scraping library it is important to consider the following points:

1) APIs

2) Features

3) Performance

4) Stability

5) Active Community


About Us

Bolster is the only automated digital risk protection platform in the world that detects, analyses, and takes down fraudulent sites and content across the web, social media, app stores, marketplaces, and the dark web.

Interested in learning more about Bolster's solutions? Request a demo here.

If you are interested in advanced cybersecurity research and working with cutting-edge AI, come work with us at Bolster. Check out open positions here.