puppet_master

Super Powers with Puppeteer

Today you are going to aquire a new super power… the power to control your browser with your thoughts! Okay, well not exactly with your thoughts, but with these scripts that I will show you how to use with Puppetteer (the most popular and powerful browser automation tool). We will cover many aspects of Puppeteer such as crawling websites that require logging in, storing tokens/cookies, as well as following links and pages within the site domain to build a comprehensive knowledge graph of the site’s data:

Introduction

Puppeteer is a Node.js library developed by Google that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It allows you to programmatically launch a browser instance, navigate to web pages, interact with elements on a page, execute JavaScript code, capture screenshots, generate PDFs, and much more.

In this tutorial, we will focus on using Puppeteer for web scraping and automation. Specifically, we will cover:

  • Launching a headless Chrome browser with Puppeteer
  • Logging into websites and saving login session cookies/tokens
  • Waiting for elements and network requests to load
  • Interacting with page elements like forms, buttons and links
  • Executing clicks, hovers, taps, and other UI interactions
  • Navigating to additional pages and domains
  • Extracting information from the pages
  • Building a comprehensive knowledge graph from the scraped data

This will provide you with the skills to build robust web automation and scraping scripts that can log into websites, traverse multiple pages, extract data, and assemble structured knowledge graphs.

Prerequisites

Before starting, you should have:

  • Node.js and NPM installed on your system
  • Basic JavaScript knowledge
  • Familiarity with Chrome DevTools for debugging

Let’s get started!

Launching Puppeteer

First, we need to install the puppeteer module. Create a new folder for your project:

mkdir puppeteer-scraper cd puppeteer-scraper npm init -y npm i puppeteer

This will create a package.json file and install puppeteer.

Next, create a scraper.js file and add the following code:

js

const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://example.com'); await browser.close(); })();

This launches a new Chromium instance in headless mode, opens a new tab, navigates to example.com, and then closes the browser.

Run it with:

node scraper.js

This will launch and close the browser automatically. The headless: false option can be passed to puppeteer.launch() to see the browser UI.

Logging Into Websites

Many websites require you to login before accessing pages and content. Puppeteer provides ways to automate this login process.

The page.type() method allows you to type text into an input field. Along with page.click(), you can programmatically enter usernames and passwords.

For example:

js

// Navigate to login page await page.goto('https://example.com/login'); // Type username await page.type('#username', 'myuser'); // Type password await page.type('#password', 'mypassword'); // Click login button await page.click('#login');

This will enter the username and password, and click the login button.

However, this process is not very robust. A better approach is to wait for the username and password fields to load before interacting with them:

js

// Navigate to login page await page.goto('https://example.com/login'); // Wait for username field to load await page.waitForSelector('#username'); // Type username await page.type('#username', 'myuser'); // Wait for password field to load await page.waitForSelector('#password'); // Type password await page.type('#password', 'mypassword'); // Wait for login button to load await page.waitForSelector('#login'); // Click login await page.click('#login');

This waits for each element to load before interacting with it, preventing errors.

Saving Login Session

After logging in, you want to save the login session so subsequent requests are authenticated.

Puppeteer pages persistent cookies and tokens by default. So any additional pages opened with page.goto() will share the same session.

To save the cookies to reuse later, you can use:

js

// Save cookies to variable const cookies = await page.cookies(); // Write cookies to file const fs = require('fs'); fs.writeFileSync('cookies.json', JSON.stringify(cookies));

This stores the cookies in a file. To use them again:

js

// Load cookies const cookiesString = fs.readFileSync('cookies.json'); const cookies = JSON.parse(cookiesString); const browser = await puppeteer.launch(); const page = await browser.newPage(); // Set cookies await page.setCookie(...cookies); // Navigate to any page await page.goto(url);

The page will now have the saved session cookies, logging you in automatically!

Waiting for Page Loads

After clicking links, submitting forms, and navigating to new pages, you need to wait for the page to fully load before interacting with elements.

Puppeteer offers several wait methods:

page.waitForNavigation() – Wait for the page to load after navigation:

js

await page.click('a.signup'); // Wait for navigation to complete await page.waitForNavigation();

page.waitForRequest() – Wait for a network request to complete:

js

const [request] = await Promise.all([ page.waitForRequest(request => request.url() === 'https://example.com/api'), page.click('#submit'), ]);

This waits for the API request to finish after clicking submit.

page.waitForSelector() – Wait for an element to appear:

js

await page.waitForSelector('div.notification');

Wait for more complex conditions with page.waitForFunction():

js

await page.waitForFunction('document.title === "My Page"');

This waits until the page title changes to “My Page”.

Interacting with Page Elements

Puppeteer provides many methods for interacting with elements on the page:

Clicking

js

// Click element by selector await page.click('.submit'); // Click button after waiting for it await page.waitForSelector('button'); await page.click('button');

Typing Text

js

await page.type('#search', 'Hello World');

Pressing Keys

js

// Press Enter await page.keyboard.press('Enter'); // Press arrow keys await page.keyboard.press('ArrowLeft');

Selecting Options

js

// Select option by text await page.select('select#colors', 'Blue'); // Select option by value await page.select('select#colors', 'blue');

Hovering

js

await page.hover('button'); // Hover over button

Scrolling

js

// Scroll to top await page.evaluate(() => window.scrollTo(0, 0)); // Scroll to bottom await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight) );

Taking Screenshots

js

await page.screenshot({path: 'page.png'});

This covers the basic ways to interact with page elements. Refer to the Puppeteer docs for many additional element handling methods.

Executing Complex Interactions

To perform multi-step interactions like filling out forms, you will need to:

  • Wait for each field/element to load
  • Type values into the fields
  • Click any required buttons or links
  • Wait for form submission and page navigation

For example:

js

// Wait for email field await page.waitForSelector('#email'); // Type email await page.type('#email', 'my@email.com'); // Wait for name field await page.waitForSelector('#name'); // Type name await page.type('#name', 'My Name'); // Wait for Sign Up button await page.waitForSelector('#signup-button'); // Click button await page.click('#signup-button'); // Wait for navigation await page.waitForNavigation(); // Check new URL const url = page.url(); if (url.includes('welcome')) { console.log('Sign up successful!'); }

This walks through each step of the sign up form, handling waits and navigation.

The same approach works for complex application testing flows like:

  • Logging into a site
  • Accessing account pages
  • Verifying account settings
  • Submitting forms
  • Checking for success/error messages

Robust syncing of waits and navigation handlers is key.

Navigating to Other Pages

To crawl an entire website, you need to follow links to navigate from page to page.

Clicking Links

Clicking a link will navigate automatically:

js

await page.waitForSelector('a.products'); // Click link await page.click('a.products'); // Wait for navigation await page.waitForNavigation();

Getting All Links

To queue up additional URLs to visit, you can extract all link URLs on a page:

js

// Get all links const links = await page.$$eval('a', as => as.map(a => a.href) ); // Queue links for crawling for (let link of links) { queue.push(link); }

Intercepting Requests

To log navigation requests and redirect to different URLs:

js

// Log requests page.on('request', req => { console.log(req.url()); }); // Redirect navigation await page.setRequestInterception(true); page.on('request', req => { if(req.resourceType() === 'document') { req.continue({ url: 'http://example.com' }); } else { req.continue(); } });

This allows flexible control over navigation.

Extracting Data from Pages

Once you have navigated to a page, you can extract useful data from it.

Page Content

The page.content() method returns the full HTML content of the page:

js

const html = await page.content(); // Page HTML

Individual Elements

Use CSS selectors to extract specific elements:

js

// Get heading text const heading = await page.$eval('h1', el => el.textContent); // Get image src const imgSrc = await page.$eval('img', el => el.src);

Multiple Elements

Use $$eval to extract data from multiple matching elements:

js

// Get all product prices const prices = await page.$$eval('div.product', divs => divs.map(div => div.dataset.price) );

JSON Objects

Extract JavaScript objects from <script> tags:

js

// Get JSON data const jsonData = await page.$eval('script#data', el => JSON.parse(el.textContent) );

Screenshots

Take screenshots of full pages or specific elements:

js

// Screenshot of full page await page.screenshot({path: 'page.png'}); // Screenshot of element await page.screenshot({path: 'image.png', clip: { x: 0, y: 0, width: 100, height: 100 }});

This provides a toolkit for extracting any data from pages!

Building a Knowledge Graph

Now that we can extract data from individual pages, we can put it together into a structured knowledge graph.

As you crawl each page, collect the desired data into JSON objects:

js

const pageData = { url: page.url(), title: await page.title(), links: await page.$$eval('a', as => as.map(a => a.href)), images: await page.$$eval('img', imgs => imgs.map(img => img.src) ), }

Assemble page data objects into a graph:

js

const graph = {}; async function crawl(url) { const page = await browser.newPage(); await page.goto(url); const pageData = { // Extract data from page }; graph[url] = pageData; for (let link of pageData.links) { await crawl(link); } await page.close(); } await crawl('https://example.com');

Recursing through links accumulating page data builds up the knowledge graph.

Run across multiple pages to assemble a comprehensive representation of an entire website’s structure and content!

The graph data can then be analyzed, visualized, and exported as a JSON dataset.

Closing Thoughts

This summarizes some of the key techniques for building robust web automation and scraping scripts with Puppeteer:

  • Automating logins and saving session cookies
  • Waiting for page loads and element availability
  • Interacting with forms, buttons, links and more
  • Navigating between pages on a site
  • Extracting structured data from pages
  • Assembling knowledge graphs from multiple pages

There are many additional capabilities like intercepting network requests, generating PDFs, running online code, automating mouse movement, and much more.

Puppeteer provides a flexible framework for modeling real browser interactions. With some creativity you can automate and extract data from almost any website.

The core ideas are:

  • Using proper waits for page loads and elements
  • Interacting with elements like a real user
  • Gathering structured data from each page
  • Recursively following links to build a full site graph

Hopefully this quick overview of the world of browser automation and Puppeteer provides you with all the knowledge and curiousity to start building robust browser automation projects! Let me know if you have any other questions.

,
%d bloggers like this: