There are vast amounts of data available on the internet just waiting to be scraped. Being able to scrape website/app data programmatically is quite powerful, and allows you to do some very interesting things. Today, my goal is to introduce you to web scraping with Node.js and Puppeteer.
Puppeteer is a Node.js library which provides a high-level API to control Google Chrome/Chromium. It runs headless by default, which means you don't even need to open a browser to interact with webpages. For example, visiting a website is as easy as this:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
})();
From there, you can use a vast API provided by Puppeteer to do pretty much anything you can imagine. For example,
if I wanted to access the text of an h1 header example.com
, I could do the following:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const header = await page.evaluate(() => {
return document.querySelector('h1').innerText
})
})();
The variable header
will now be a string that you can use within your Node.js code.
I could show you endless code snippets that explain how Puppeteer works, but I'd rather show you a real world example of scraping data with it. To do so, we will scrape the latest esports team standings for the game DOTA from DotaBuff.
Let's get started by bootstrapping our code.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.dotabuff.com/procircuit/team-standings');
})();
This will allow us to visit the standings page. From there, we want to access the data inside the HTML table. We can easily access this by doing the following:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.dotabuff.com/procircuit/team-standings');
const tableRows = Array.from(document.querySelectorAll("tbody > tr"))
})();
In one line of code we now have access to every single row in the table. For our example, we only want to extract the data from columns 2, 3, 4, 5, 7.
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.dotabuff.com/procircuit/team-standings');
const standings = await page.evaluate(() => {
return Array.from(document.querySelectorAll("tbody > tr")).map(team => {
return [
team
.querySelector(td:nth-child(2)')
.getAttribute("data-value"),
team
.querySelector(td:nth-child(3)')
.getAttribute("data-value"),
team
.querySelector(td:nth-child(4)')
.getAttribute("data-value"),
team
.querySelector(td:nth-child(5)')
.getAttribute("data-value"),
team
.querySelector(td:nth-child(7)')
.getAttribute("data-value"),
]
})
})
This is pretty ugly. If you are like me, and you see a bunch of code being repeated, there is a good chance
that it can be improved. We can fix this by creating a helper function called getColumn
that extracts the data for us when we
pass it a column number as an argument.
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.dotabuff.com/procircuit/team-standings');
const standings = await page.evaluate(() => {
return Array.from(document.querySelectorAll("tbody > tr")).map(team => {
const getColumn = (num) => {
return team
.querySelector(`td:nth-child(${num})}`)
.getAttribute("data-value");
}
return [
getColumn(2),
getColumn(3),
getColumn(4),
getColumn(5),
getColumn(7)
]
})
})
Our variable standings
should now look like this in our Node.js code.
[
["Team Secret", "14400.0", "2966416.0", "2", "23350"],
["Virtus.pro", "13500.0", "1497139.0", "1", "9035"],
["Vici Gaming", "11250.0", "2112493.0", "2", "3298"],
["Evil Geniuses", "6825.0", "1517493.0", "0", "55961"],
["Team Liquid", "5820.0", "4726402.0", "0", "0"],
["PSG.LGD", "5040.0", "3342125.0", "0", "8125"],
...more teams
]
From here we can do whatever we want with this data. We could write it to a file on our computer, convert it to a CSV, send it via email etc. For this example though, we will write the standings to a JSON file and call it a day.
Here is the final code.
const fs = require('fs');
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.dotabuff.com/procircuit/team-standings');
const standings = await page.evaluate(() => {
return Array.from(document.querySelectorAll("tbody > tr")).map(team => {
const getColumn = (num) => {
return team
.querySelector(`td:nth-child(${num})}`)
.getAttribute("data-value");
}
return [
getColumn(2),
getColumn(3),
getColumn(4),
getColumn(5),
getColumn(7)
]
})
fs.writeFileSync('./standings.json', JSON.stringify(standings))
await browser.close();
})
If you have any problems understanding any of this, feel free to shoot me a message, and I'll make sure it's crystal clear.