This blog is part 1 of a web scraping series.
The Bolster Research Team evaluated the performance and working of various libraries available in NodeJS for Web Scraping. Throughout this blog, we will see how we can use different libraries in Node JS to implement web scraping.
In the subsequent blog posts, we will analyze various NodeJS libraries, and provide tips on how to use web scraping technology to strengthen your cybersecurity program.
What is Web Scraping?
But let’s jump into it!
Web scraping is about extracting information from web pages. A website can consist of various types of information, including text, images, audio, videos, scripts, and forms.
It’s important to distinguish the difference between web scraping and other types of data-extraction techniques.
Crawling vs web scraping
When we want to search for information, crawling is the way. When we want to extract information, scraping is the way. So web crawling would mean movement through links or URLs and web scraping means the extraction of information from a particular page/website.
Consider the following example: you want to find a person’s contact information from a website. Crawling can help find a specific page, like a contact page or about us page, and scraping can help get the contact information of the person.
How does automation factor into the web scraping conversation?
When reading about web crawling and scraping, we often encounter the term “web automation”. Once scraping is carried out, we can automate tasks like form submission, data extraction, testing, and validation. We will discuss some web automation techniques in future blogs.
We will use various libraries in NodeJS to demonstrate the quick implementation of scraping. We will scrap the content of the title tag in this article using various libraries.
Web scraping and JSDOM