Today you are going to aquire a new super power… the power to control your browser with your thoughts! Okay, well not exactly with your thoughts, but with these scripts that I will show you how to use with Puppetteer (the most popular and powerful browser automation tool). We will cover many aspects of Puppeteer such as crawling websites that require logging in, storing tokens/cookies, as well as following links and pages within the site domain to build a comprehensive knowledge graph of the site’s data:
In this tutorial, we will focus on using Puppeteer for web scraping and automation. Specifically, we will cover:
- Launching a headless Chrome browser with Puppeteer
- Logging into websites and saving login session cookies/tokens
- Waiting for elements and network requests to load
- Interacting with page elements like forms, buttons and links
- Executing clicks, hovers, taps, and other UI interactions
- Navigating to additional pages and domains
- Extracting information from the pages
- Building a comprehensive knowledge graph from the scraped data
This will provide you with the skills to build robust web automation and scraping scripts that can log into websites, traverse multiple pages, extract data, and assemble structured knowledge graphs.
Before starting, you should have:
- Node.js and NPM installed on your system
- Familiarity with Chrome DevTools for debugging
Let’s get started!
First, we need to install the
puppeteer module. Create a new folder for your project:
This will create a
package.json file and install
Next, create a
scraper.js file and add the following code:
This launches a new Chromium instance in headless mode, opens a new tab, navigates to example.com, and then closes the browser.
Run it with:
This will launch and close the browser automatically. The
headless: false option can be passed to
puppeteer.launch() to see the browser UI.
Logging Into Websites
Many websites require you to login before accessing pages and content. Puppeteer provides ways to automate this login process.
page.type() method allows you to type text into an input field. Along with
page.click(), you can programmatically enter usernames and passwords.
This will enter the username and password, and click the login button.
However, this process is not very robust. A better approach is to wait for the username and password fields to load before interacting with them:
This waits for each element to load before interacting with it, preventing errors.
Saving Login Session
After logging in, you want to save the login session so subsequent requests are authenticated.
Puppeteer pages persistent cookies and tokens by default. So any additional pages opened with
page.goto() will share the same session.
To save the cookies to reuse later, you can use:
This stores the cookies in a file. To use them again:
The page will now have the saved session cookies, logging you in automatically!
Waiting for Page Loads
After clicking links, submitting forms, and navigating to new pages, you need to wait for the page to fully load before interacting with elements.
Puppeteer offers several wait methods:
page.waitForNavigation() – Wait for the page to load after navigation:
page.waitForRequest() – Wait for a network request to complete:
This waits for the API request to finish after clicking submit.
page.waitForSelector() – Wait for an element to appear:
Wait for more complex conditions with page.waitForFunction():
This waits until the page title changes to “My Page”.
Interacting with Page Elements
Puppeteer provides many methods for interacting with elements on the page:
This covers the basic ways to interact with page elements. Refer to the Puppeteer docs for many additional element handling methods.
Executing Complex Interactions
To perform multi-step interactions like filling out forms, you will need to:
- Wait for each field/element to load
- Type values into the fields
- Click any required buttons or links
- Wait for form submission and page navigation
This walks through each step of the sign up form, handling waits and navigation.
The same approach works for complex application testing flows like:
- Logging into a site
- Accessing account pages
- Verifying account settings
- Submitting forms
- Checking for success/error messages
Robust syncing of waits and navigation handlers is key.
Navigating to Other Pages
To crawl an entire website, you need to follow links to navigate from page to page.
Clicking a link will navigate automatically:
Getting All Links
To queue up additional URLs to visit, you can extract all link URLs on a page:
To log navigation requests and redirect to different URLs:
This allows flexible control over navigation.
Extracting Data from Pages
Once you have navigated to a page, you can extract useful data from it.
page.content() method returns the full HTML content of the page:
Use CSS selectors to extract specific elements:
$$eval to extract data from multiple matching elements:
Take screenshots of full pages or specific elements:
This provides a toolkit for extracting any data from pages!
Building a Knowledge Graph
Now that we can extract data from individual pages, we can put it together into a structured knowledge graph.
As you crawl each page, collect the desired data into JSON objects:
Assemble page data objects into a graph:
Recursing through links accumulating page data builds up the knowledge graph.
Run across multiple pages to assemble a comprehensive representation of an entire website’s structure and content!
The graph data can then be analyzed, visualized, and exported as a JSON dataset.
This summarizes some of the key techniques for building robust web automation and scraping scripts with Puppeteer:
- Automating logins and saving session cookies
- Waiting for page loads and element availability
- Interacting with forms, buttons, links and more
- Navigating between pages on a site
- Extracting structured data from pages
- Assembling knowledge graphs from multiple pages
There are many additional capabilities like intercepting network requests, generating PDFs, running online code, automating mouse movement, and much more.
Puppeteer provides a flexible framework for modeling real browser interactions. With some creativity you can automate and extract data from almost any website.
The core ideas are:
- Using proper waits for page loads and elements
- Interacting with elements like a real user
- Gathering structured data from each page
- Recursively following links to build a full site graph
Hopefully this quick overview of the world of browser automation and Puppeteer provides you with all the knowledge and curiousity to start building robust browser automation projects! Let me know if you have any other questions.