Example Website Screenshot Crawler
No credit card required
Example Website Screenshot Crawler
No credit card required
Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.
Website Screenshot Crawler
A template for automated website screenshot capturing. This actor takes screenshots of websites from specified URLs, uploads them to Apify Key-Value Store, and provides screenshot URLs in a dataset. It is ideal for monitoring website changes, archiving web content, or capturing visuals for reports. The actor uses Pyppeteer for browser automation and screenshot generation.
Source Code
You can find the source code for this actor in my GitHub account:
Included Features
- Apify SDK - A toolkit for building Apify Actors and scrapers in Python.
- Pyppeteer - A Python port of Puppeteer, an open-source tool for automating web browsers using a high-level API.
- Key-Value Store - Store screenshots and metadata for easy retrieval.
- Dataset - Structured storage for results like screenshot URLs and metadata.
- Cookie and Viewport Support - Allows setting cookies and specifying the viewport dimensions before capturing screenshots.
Input
The input for this actor should be JSON containing the necessary configuration. The only required field is link_urls
, which must be an array of website URLs. All other fields are optional. Here’s a detailed description of the input fields:
Field | Type | Description | Allowed Values |
---|---|---|---|
link_urls | Array | An array of website URLs to capture screenshots of. | Any valid URL |
Sleep | Number | Duration to wait after the page has loaded before taking a screenshot (in seconds). | Minimum: 0, Maximum: 3600 |
waitUntil | String | Event to wait for before taking the screenshot. | One of: "load" , "domcontentloaded" , "networkidle2" , "networkidle0" |
cookies | Array | Any cookies to set for the browser session. | Array of cookie objects |
fullPage | Boolean | Whether to capture the full page or just the viewport. | true or false |
window_Width | Number | Width of the browser viewport. | Minimum: 100, Maximum: 3840 |
window_Height | Number | Height of the browser viewport. | Minimum: 100, Maximum: 2160 |
scrollToBottom | Boolean | Should the browser scroll to the bottom of the page before taking a screenshot? | true or false |
distance | Number | Distance (in pixels) to scroll down for each scroll action. | Minimum: 0 |
delay | Number | Delay (in milliseconds) between scroll actions. | Minimum: 0, Maximum: 3600000 |
delayAfterScrolling | Number | Specify the delay (in milliseconds) after scrolling to the bottom of the page before taking a screenshot. | Minimum: 0, Maximum: 3600000 |
waitUntilNetworkIdleAfterScroll | Boolean | Choose whether to wait for the network to become idle after scrolling to the bottom of the page. | true or false |
waitUntilNetworkIdleAfterScrollTimeout | Number | Maximum wait time (in milliseconds) for the network to become idle after scrolling. | Minimum: 1000, Maximum: 3600000 |
For more information about the waitUntil
parameter, please refer to the Puppeteer page.goto function documentation.
Output
Once the actor finishes executing, it will output a screenshot of each website into a file stored in the Key-Value Store associated with the run. The screenshot URLs will also be stored in a dataset for easy access.
How It Works
- Input Configuration: The actor reads the input data as specified above.
- Browser Automation: The actor launches a headless browser using Pyppeteer, loading the target URLs, and capturing screenshots.
- Setting Cookies and Viewport: Before navigating to each link, specified cookies are set using
page.setCookie()
, and the viewport is configured with specified width and height. - Page Navigation: The actor navigates to each URL using
page.goto()
, waiting for the specifiedwaitUntil
event. - Scrolling (Optional): If the
scrollToBottom
option is enabled, the actor executes a scrolling script that scrolls down the page by the defineddistance
in pixels. - Screenshot Capture: After the page has fully loaded, the actor waits for the
Sleep
duration before capturing the screenshot and saves it with a random filename. - Uploading Screenshots: The captured screenshots are read as binary data and uploaded to the Apify Key-Value Store using
Actor.set_value()
, with URLs stored in the dataset. - Logging and Error Handling: The actor logs the success or failure of each URL processed, ensuring that it can continue processing even if one fails.
- Cleanup: After processing all URLs, the actor closes the browser.
This open-source actor effectively automates the process of capturing and storing screenshots of multiple web pages, making it a valuable tool for monitoring website changes, archiving content, or generating visual reports.
Resources
Getting Started
To get started with this actor:
- Build the Actor: Define your input URLs and configure optional settings like scrolling and sleep duration.
- Run the Actor: Execute the actor on the Apify platform or locally using the Apify CLI.
Pull the Actor for Local Development
To develop this actor locally, follow these steps:
-
Install
apify-cli
:Using Homebrew:
brew install apify-cli
Using NPM:
npm install -g apify-cli
-
Pull the Actor using its unique
<ActorId>
:apify pull <ActorId>
Example Use Cases
- Website Monitoring: Capture screenshots periodically to monitor changes to web pages.
- Visual Archiving: Store visual representations of websites over time for research or archival purposes.
- Reporting: Automatically capture visuals for reports or presentations.
Documentation Reference
- Apify SDK for Python
- Apify Platform Documentation
- Pyppeteer API Documentation
- Join the Apify Developer Community
Contact Information
For any inquiries, you can reach me at:
Email: fridaytechnolog@gmail.com
GitHub: https://github.com/DZ-ABDLHAKIM
Twitter: https://x.com/DZ_45Omar
- 5 monthly users
- 1 star
- 100.0% runs succeeded
- Created in Oct 2024
- Modified 23 days ago