This blog is part 1 of a web scraping series.
The Bolster Research Team evaluated the performance and working of various libraries available in NodeJS for Web Scraping. Throughout this blog, we will see how we can use different libraries in Node JS to implement web scraping.
In the subsequent blog posts, we will analyze various NodeJS libraries, and provide tips on how to use web scraping technology to strengthen your cybersecurity program.
What is Web Scraping?
Before we dive into web scraping, it’s important to have background knowledge of HTML DOM (Hyper Text Markup Language and Document Object Model) and JS (JavaScript). We recommend familiarizing yourself with developer resources to get started!
But let’s jump into it!
Web scraping is about extracting information from web pages. A website can consist of various types of information, including text, images, audio, videos, scripts, and forms.
It’s important to distinguish the difference between web scraping and other types of data-extraction techniques.
Crawling vs web scraping
When we want to search for information, crawling is the way. When we want to extract information, scraping is the way. So web crawling would mean movement through links or URLs and web scraping means the extraction of information from a particular page/website.
Consider the following example: you want to find a person’s contact information from a website. Crawling can help find a specific page, like a contact page or about us page, and scraping can help get the contact information of the person.
How does automation factor into the web scraping conversation?
When reading about web crawling and scraping, we often encounter the term “web automation”. Once scraping is carried out, we can automate tasks like form submission, data extraction, testing, and validation. We will discuss some web automation techniques in future blogs.
We will use various libraries in NodeJS to demonstrate the quick implementation of scraping. We will scrap the content of the title tag in this article using various libraries.
Web scraping and JSDOM
As per the official documentation, jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. In general, the goal of the project is to emulate enough of a subset of a web browser to be useful for testing and scraping real-world web applications.
Cheerio
As per its official guide, Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does.
Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript which is common for a SPA (single page application).
This makes Cheerio much, much faster than other solutions. If your use case requires any of this functionality, you should consider browser automation software like Puppeteer and Playwright or DOM emulation projects like JSDom.
Playwright
As per its official guide, Playwright can either be used as a part of the Playwright Test test runner (this guide), or as a Playwright Library.
Playwright Test was created specifically to accommodate the needs of end-to-end testing. It does everything you would expect from the regular test runner and more. Playwright test allows to:
- Run tests across all browsers.
- Execute tests in parallel.
- Enjoy context isolation out of the box.
- Capture videos, screenshots, and other artifacts on failure.
- Integrate your POMs as extensible fixtures
Puppeteer
As per the official guide, Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
Most things that you can do manually in the browser can be done using Puppeteer.
Here are a few examples to get you started:
- Generate screenshots and PDFs of pages
- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e., “SSR” (Server-Side Rendering))
- Automate form submission, UI testing, keyboard input, etc
- Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features
- Capture a timeline trace of your site to help diagnose performance issues
- Test Chrome Extensions
Other ccraping libraries
During our quick run, we evaluated libraries that provided APIs to extract the title from the document of the requested URL. There are other libraries that you can try for scraping. They are Osmosis, and X-RAY which are more equipped with testing components.
There are popular and advanced automation tools like Cypress and Selenium. Terms like web crawling, scraping or automation tools are found to be used interchangeably, but on functional consideration, they differ heavily.
There are various paid scraping options. These provide dashboards and tools to scrap websites. A simple search can land you multiple options.
Selection and comparison of scraping libraries
We will compare these libraries in upcoming articles. When considering using any scraping library it is important to consider the following points:
- APIs
- Features
- Performance
- Stability
- Active Community
About Us
Bolster is the only automated digital risk protection platform in the world that detects, analyses, and takes down fraudulent sites and content across the web, social media, app stores, marketplaces, and the dark web.
Interested in learning more about Bolster’s solutions? Request a demo here.
If you are interested in advanced cybersecurity research and working with cutting-edge AI, come work with us at Bolster. Check out open positions here.