Example Website Screenshot Crawler avatar

Example Website Screenshot Crawler

Try for free

No credit card required

Go to Store
Example Website Screenshot Crawler

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler
Try for free

No credit card required

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Website Screenshot Crawler

A template for automated website screenshot capturing. This actor takes screenshots of websites from specified URLs, uploads them to Apify Key-Value Store, and provides screenshot URLs in a dataset. It is ideal for monitoring website changes, archiving web content, or capturing visuals for reports. The actor uses Pyppeteer for browser automation and screenshot generation.

Source Code

You can find the source code for this actor in my GitHub account:

Included Features

  • Apify SDK - A toolkit for building Apify Actors and scrapers in Python.
  • Pyppeteer - A Python port of Puppeteer, an open-source tool for automating web browsers using a high-level API.
  • Key-Value Store - Store screenshots and metadata for easy retrieval.
  • Dataset - Structured storage for results like screenshot URLs and metadata.
  • Cookie and Viewport Support - Allows setting cookies and specifying the viewport dimensions before capturing screenshots.

Input

The input for this actor should be JSON containing the necessary configuration. The only required field is link_urls, which must be an array of website URLs. All other fields are optional. Here’s a detailed description of the input fields:

FieldTypeDescriptionAllowed Values
link_urlsArrayAn array of website URLs to capture screenshots of.Any valid URL
SleepNumberDuration to wait after the page has loaded before taking a screenshot (in seconds).Minimum: 0, Maximum: 3600
waitUntilStringEvent to wait for before taking the screenshot.One of: "load", "domcontentloaded", "networkidle2", "networkidle0"
cookiesArrayAny cookies to set for the browser session.Array of cookie objects
fullPageBooleanWhether to capture the full page or just the viewport.true or false
window_WidthNumberWidth of the browser viewport.Minimum: 100, Maximum: 3840
window_HeightNumberHeight of the browser viewport.Minimum: 100, Maximum: 2160
scrollToBottomBooleanShould the browser scroll to the bottom of the page before taking a screenshot?true or false
distanceNumberDistance (in pixels) to scroll down for each scroll action.Minimum: 0
delayNumberDelay (in milliseconds) between scroll actions.Minimum: 0, Maximum: 3600000
delayAfterScrollingNumberSpecify the delay (in milliseconds) after scrolling to the bottom of the page before taking a screenshot.Minimum: 0, Maximum: 3600000
waitUntilNetworkIdleAfterScrollBooleanChoose whether to wait for the network to become idle after scrolling to the bottom of the page.true or false
waitUntilNetworkIdleAfterScrollTimeoutNumberMaximum wait time (in milliseconds) for the network to become idle after scrolling.Minimum: 1000, Maximum: 3600000

For more information about the waitUntil parameter, please refer to the Puppeteer page.goto function documentation.

Output

Once the actor finishes executing, it will output a screenshot of each website into a file stored in the Key-Value Store associated with the run. The screenshot URLs will also be stored in a dataset for easy access.

How It Works

  1. Input Configuration: The actor reads the input data as specified above.
  2. Browser Automation: The actor launches a headless browser using Pyppeteer, loading the target URLs, and capturing screenshots.
  3. Setting Cookies and Viewport: Before navigating to each link, specified cookies are set using page.setCookie(), and the viewport is configured with specified width and height.
  4. Page Navigation: The actor navigates to each URL using page.goto(), waiting for the specified waitUntil event.
  5. Scrolling (Optional): If the scrollToBottom option is enabled, the actor executes a scrolling script that scrolls down the page by the defined distance in pixels.
  6. Screenshot Capture: After the page has fully loaded, the actor waits for the Sleep duration before capturing the screenshot and saves it with a random filename.
  7. Uploading Screenshots: The captured screenshots are read as binary data and uploaded to the Apify Key-Value Store using Actor.set_value(), with URLs stored in the dataset.
  8. Logging and Error Handling: The actor logs the success or failure of each URL processed, ensuring that it can continue processing even if one fails.
  9. Cleanup: After processing all URLs, the actor closes the browser.

This open-source actor effectively automates the process of capturing and storing screenshots of multiple web pages, making it a valuable tool for monitoring website changes, archiving content, or generating visual reports.

Resources

Getting Started

To get started with this actor:

  1. Build the Actor: Define your input URLs and configure optional settings like scrolling and sleep duration.
  2. Run the Actor: Execute the actor on the Apify platform or locally using the Apify CLI.

Pull the Actor for Local Development

To develop this actor locally, follow these steps:

  1. Install apify-cli:

    Using Homebrew:

    brew install apify-cli

    Using NPM:

    npm install -g apify-cli
  2. Pull the Actor using its unique <ActorId>:

    apify pull <ActorId>

Example Use Cases

  • Website Monitoring: Capture screenshots periodically to monitor changes to web pages.
  • Visual Archiving: Store visual representations of websites over time for research or archival purposes.
  • Reporting: Automatically capture visuals for reports or presentations.

Documentation Reference

Contact Information

For any inquiries, you can reach me at:
Email: fridaytechnolog@gmail.com
GitHub: https://github.com/DZ-ABDLHAKIM
Twitter: https://x.com/DZ_45Omar

Developer
Maintained by Community

Actor Metrics

  • 3 monthly users

  • 2 stars

  • 85% runs succeeded

  • Created in Oct 2024

  • Modified 3 months ago

Categories