Super Powers with Puppeteer

Today you are going to aquire a new super power… the power to control your browser with your thoughts! Okay, well not exactly with your thoughts, but with these scripts that I will show you how to use with Puppetteer (the most popular and powerful browser automation tool). We will cover many aspects of Puppeteer such as crawling websites that require logging in, storing tokens/cookies, as well as following links and pages within the site domain to build a comprehensive knowledge graph of the site’s data:

Introduction

Puppeteer is a Node.js library developed by Google that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It allows you to programmatically launch a browser instance, navigate to web pages, interact with elements on a page, execute JavaScript code, capture screenshots, generate PDFs, and much more.

In this tutorial, we will focus on using Puppeteer for web scraping and automation. Specifically, we will cover:

Launching a headless Chrome browser with Puppeteer
Logging into websites and saving login session cookies/tokens
Waiting for elements and network requests to load
Interacting with page elements like forms, buttons and links
Executing clicks, hovers, taps, and other UI interactions
Navigating to additional pages and domains
Extracting information from the pages
Building a comprehensive knowledge graph from the scraped data

This will provide you with the skills to build robust web automation and scraping scripts that can log into websites, traverse multiple pages, extract data, and assemble structured knowledge graphs.

Prerequisites

Before starting, you should have:

Node.js and NPM installed on your system
Basic JavaScript knowledge
Familiarity with Chrome DevTools for debugging

Let’s get started!

Launching Puppeteer

First, we need to install the puppeteer module. Create a new folder for your project:


mkdir puppeteer-scraper
cd puppeteer-scraper
npm init -y
npm i puppeteer

This will create a package.json file and install puppeteer.

Next, create a scraper.js file and add the following code:

js
const puppeteer = require('puppeteer');

(async () => {

  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://example.com');

  await browser.close();

})();

This launches a new Chromium instance in headless mode, opens a new tab, navigates to example.com, and then closes the browser.

Run it with:


node scraper.js

This will launch and close the browser automatically. The headless: false option can be passed to puppeteer.launch() to see the browser UI.

Logging Into Websites

Many websites require you to login before accessing pages and content. Puppeteer provides ways to automate this login process.

The page.type() method allows you to type text into an input field. Along with page.click(), you can programmatically enter usernames and passwords.

For example:

js
// Navigate to login page
await page.goto('https://example.com/login'); 

// Type username
await page.type('#username', 'myuser');

// Type password  
await page.type('#password', 'mypassword');

// Click login button
await page.click('#login');

This will enter the username and password, and click the login button.

However, this process is not very robust. A better approach is to wait for the username and password fields to load before interacting with them:

js
// Navigate to login page
await page.goto('https://example.com/login');

// Wait for username field to load
await page.waitForSelector('#username');

// Type username  
await page.type('#username', 'myuser'); 

// Wait for password field to load
await page.waitForSelector('#password'); 

// Type password
await page.type('#password', 'mypassword'); 

// Wait for login button to load 
await page.waitForSelector('#login');

// Click login  
await page.click('#login');

This waits for each element to load before interacting with it, preventing errors.

Saving Login Session

After logging in, you want to save the login session so subsequent requests are authenticated.

Puppeteer pages persistent cookies and tokens by default. So any additional pages opened with page.goto() will share the same session.

To save the cookies to reuse later, you can use:

js
// Save cookies to variable
const cookies = await page.cookies(); 

// Write cookies to file 
const fs = require('fs');
fs.writeFileSync('cookies.json', JSON.stringify(cookies));

This stores the cookies in a file. To use them again:

js
// Load cookies
const cookiesString = fs.readFileSync('cookies.json');
const cookies = JSON.parse(cookiesString);

const browser = await puppeteer.launch();
const page = await browser.newPage();

// Set cookies 
await page.setCookie(...cookies);

// Navigate to any page  
await page.goto(url);

The page will now have the saved session cookies, logging you in automatically!

Waiting for Page Loads

After clicking links, submitting forms, and navigating to new pages, you need to wait for the page to fully load before interacting with elements.

Puppeteer offers several wait methods:

page.waitForNavigation() – Wait for the page to load after navigation:

js
await page.click('a.signup'); 

// Wait for navigation to complete
await page.waitForNavigation();

page.waitForRequest() – Wait for a network request to complete:

js
const [request] = await Promise.all([
  page.waitForRequest(request => request.url() === 'https://example.com/api'),
  page.click('#submit'), 
]);

This waits for the API request to finish after clicking submit.

page.waitForSelector() – Wait for an element to appear:

js
await page.waitForSelector('div.notification');

Wait for more complex conditions with page.waitForFunction():

js
await page.waitForFunction('document.title === "My Page"');

This waits until the page title changes to “My Page”.

Interacting with Page Elements

Puppeteer provides many methods for interacting with elements on the page:

Clicking

js
// Click element by selector
await page.click('.submit');

// Click button after waiting for it 
await page.waitForSelector('button');
await page.click('button');

Typing Text

js
await page.type('#search', 'Hello World');

Pressing Keys

js
// Press Enter
await page.keyboard.press('Enter');

// Press arrow keys  
await page.keyboard.press('ArrowLeft');

Selecting Options

js
// Select option by text  
await page.select('select#colors', 'Blue'); 

// Select option by value
await page.select('select#colors', 'blue');

Hovering

js
await page.hover('button'); // Hover over button

Scrolling

js
// Scroll to top
await page.evaluate(() => window.scrollTo(0, 0));

// Scroll to bottom
await page.evaluate(() => 
  window.scrollTo(0, document.body.scrollHeight)
);

Taking Screenshots

js
await page.screenshot({path: 'page.png'});

This covers the basic ways to interact with page elements. Refer to the Puppeteer docs for many additional element handling methods.

Executing Complex Interactions

To perform multi-step interactions like filling out forms, you will need to:

Wait for each field/element to load
Type values into the fields
Click any required buttons or links
Wait for form submission and page navigation

For example:

js
// Wait for email field
await page.waitForSelector('#email'); 

// Type email 
await page.type('#email', 'my@email.com');

// Wait for name field
await page.waitForSelector('#name');

// Type name
await page.type('#name', 'My Name'); 

// Wait for Sign Up button
await page.waitForSelector('#signup-button'); 

// Click button
await page.click('#signup-button');

// Wait for navigation
await page.waitForNavigation();

// Check new URL
const url = page.url();
if (url.includes('welcome')) {
  console.log('Sign up successful!');
}

This walks through each step of the sign up form, handling waits and navigation.

The same approach works for complex application testing flows like:

Logging into a site
Accessing account pages
Verifying account settings
Submitting forms
Checking for success/error messages

Robust syncing of waits and navigation handlers is key.

Navigating to Other Pages

To crawl an entire website, you need to follow links to navigate from page to page.

Clicking Links

Clicking a link will navigate automatically:

js
await page.waitForSelector('a.products'); 

// Click link
await page.click('a.products');

// Wait for navigation
await page.waitForNavigation();

Getting All Links

To queue up additional URLs to visit, you can extract all link URLs on a page:

js
// Get all links
const links = await page.$$eval('a', as => 
  as.map(a => a.href)
);

// Queue links for crawling  
for (let link of links) {
  queue.push(link);
}

Intercepting Requests

To log navigation requests and redirect to different URLs:

js
// Log requests
page.on('request', req => {
  console.log(req.url());
});

// Redirect navigation
await page.setRequestInterception(true);
page.on('request', req => {
  if(req.resourceType() === 'document') {
    req.continue({
      url: 'http://example.com' 
    });
  } else {
    req.continue();
  }
});

This allows flexible control over navigation.

Extracting Data from Pages

Once you have navigated to a page, you can extract useful data from it.

Page Content

The page.content() method returns the full HTML content of the page:

js
const html = await page.content(); // Page HTML

Individual Elements

Use CSS selectors to extract specific elements:

js
// Get heading text
const heading = await page.$eval('h1', el => el.textContent); 

// Get image src 
const imgSrc = await page.$eval('img', el => el.src);

Multiple Elements

Use $$eval to extract data from multiple matching elements:

js
// Get all product prices
const prices = await page.$$eval('div.product', divs => 
  divs.map(div => div.dataset.price)
);

JSON Objects

Extract JavaScript objects from <script> tags:

js
// Get JSON data
const jsonData = await page.$eval('script#data', el => 
  JSON.parse(el.textContent)
);

Screenshots

Take screenshots of full pages or specific elements:

js
// Screenshot of full page
await page.screenshot({path: 'page.png'});

// Screenshot of element
await page.screenshot({path: 'image.png', clip: {
  x: 0, y: 0, width: 100, height: 100  
}});

This provides a toolkit for extracting any data from pages!

Building a Knowledge Graph

Now that we can extract data from individual pages, we can put it together into a structured knowledge graph.

As you crawl each page, collect the desired data into JSON objects:

js
const pageData = {
  url: page.url(),
  title: await page.title(),
  links: await page.$$eval('a', as => as.map(a => a.href)),
  images: await page.$$eval('img', imgs => 
    imgs.map(img => img.src)  
  ),
}

Assemble page data objects into a graph:

js
const graph = {};

async function crawl(url) {

  const page = await browser.newPage();
  await page.goto(url);

  const pageData = {
    // Extract data from page 
  };

  graph[url] = pageData; 

  for (let link of pageData.links) {
    await crawl(link);
  }

  await page.close();

}

await crawl('https://example.com');

Recursing through links accumulating page data builds up the knowledge graph.

Run across multiple pages to assemble a comprehensive representation of an entire website’s structure and content!

The graph data can then be analyzed, visualized, and exported as a JSON dataset.

Closing Thoughts

This summarizes some of the key techniques for building robust web automation and scraping scripts with Puppeteer:

Automating logins and saving session cookies
Waiting for page loads and element availability
Interacting with forms, buttons, links and more
Navigating between pages on a site
Extracting structured data from pages
Assembling knowledge graphs from multiple pages

There are many additional capabilities like intercepting network requests, generating PDFs, running online code, automating mouse movement, and much more.

Puppeteer provides a flexible framework for modeling real browser interactions. With some creativity you can automate and extract data from almost any website.

The core ideas are:

Using proper waits for page loads and elements
Interacting with elements like a real user
Gathering structured data from each page
Recursively following links to build a full site graph

Hopefully this quick overview of the world of browser automation and Puppeteer provides you with all the knowledge and curiousity to start building robust browser automation projects! Let me know if you have any other questions.

The Curious Programmer

Super Powers with Puppeteer

Like this:

Super Powers with Puppeteer

Share this:

Like this:

Discover more from The Curious Programmer