How to take screenshots of specific elements using Puppeteer

Posted November 4, 2023 by Admin · 4 min read

A client of ours recently asked us for help with exporting hundreds of assembly diagrams from their WordPress website. In this post we'll explain how we used Puppeteer to scrape content quickly and easily.

The problem

Although we had access to the website and its database, we couldn't just export images of assemblies and save them to disk because they included annotations which were being rendered separately and overlaid using CSS. The client was specifically interested in archiving the images exactly as they appeared to end users, in a pixel-perfect manner.

Puppeteer is a Node.js library for browser automation that lets you do anything you could do manually in a normal browser window but via programmatic control. We knew right from the start that Puppeteer would be the perfect tool for the job because it would faithfully render HTML, JS, and CSS using the same engine your browser uses. With Puppeteer, we would also be able to take screenshots of specific parts of the page even if they're hidden behind user interaction like mouse clicks, for example to expand an accordion or open a modal.

Setup

First, we needed to install Puppeteer:

npm i puppeteer

Next, we hopped into the database to export a list of the target pages to crawl.

SELECT 
    p.ID,
    p.post_title,
    CONCAT(
        'https://www.client-website.com/',
        p.post_type,
        '/',
        p.post_name
    ) AS url
FROM wp_posts p
WHERE p.post_status = 'publish';

From here, we can start writing the basic outline of our program. We know we have a list of URLs to crawl, and we'll be visiting each of them in some sort of loop. We'll first have to import Puppeteer using ES6 module syntax, and then initialize the browser before we get started:

import puppeteer from 'puppeteer';

const urls = [
    // list of pages from the database
];

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    for (const url of urls) {
        console.log('Navigating to', url);
        await page.goto(url);

        // TODO take screenshot
    }

    await browser.close();
})();

Targeting specific page elements

Each of the pages we're visiting has multiple assembly diagrams on it, so we'll need to select all of them and take a screenshot for each. We can use the page.$$() method to query the DOM and return a list of ElementHandles. After that, we can take a PNG screenshot of each ElementHandle and save it to disk:

const assemblies = await page.$$('.assembly_layout_wrap');

for (const assemblyElement of assemblies) {
    await assemblyElement.screenshot({
        path: `out/assembly_${new Date().getTime()}.png`
    });
}

Handling authentication with Puppeteer

The website we were targeting, while public-facing, requires logging in to access assembly information. If we ran our code right now, we'd get caught in a redirect asking us to sign in.

Technically we could also instruct Puppeteer to detect the redirect, fill out the login form and then submit it -- but that sounds like more work for a script we'll probably only ever use once. What if we could log in manually, and transfer that "logged-in"-ness to our code? We can, with session hijacking!

In a separate browser, we logged into the client website and extracted the PHPSESSID cookie and pasted it into our code:

await page.setCookie({
    domain: 'www.client-website.com',
    name: 'PHPSESSID',
    value: 'mtfjn9dqawjhoej2sp52z4clsojd7g6s'
});

Conclusion

Violà! And with that, we're done. As you can see, Puppeteer makes all of this extremely easy! After firing up the script, all we need to do now is let it run for several minutes and do its work.