JP Castnet Amazon Scraper
3 days trial then $4.99/month - No credit card required now
JP Castnet Amazon Scraper
3 days trial then $4.99/month - No credit card required now
This actor is intended to extract data from amazon.co.jp
TypeScript PuppeteerCrawler Actor template
This template is a production ready boilerplate for developing with PuppeteerCrawler
. The PuppeteerCrawler
provides a simple framework for parallel crawling of web pages using headless Chrome with Puppeteer. Since PuppeteerCrawler
uses headless Chrome to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript.
If you're looking for examples or want to learn more visit:
Included features
- Puppeteer Crawler - simple framework for parallel crawling of web pages using headless Chrome with Puppeteer
- Configurable Proxy - tool for working around IP blocking
- Input schema - define and easily validate a schema for your Actor's input
- Dataset - store structured data where each object stored has the same attributes
- Apify SDK - toolkit for building Actors
How it works
Actor.getInput()
gets the input fromINPUT.json
where the start urls are defined- Create a configuration for proxy servers to be used during the crawling with
Actor.createProxyConfiguration()
to work around IP blocking. Use Apify Proxy or your own Proxy URLs provided and rotated according to the configuration. You can read more about proxy configuration here. - Create an instance of Crawlee's Puppeteer Crawler with
new PuppeteerCrawler()
. You can pass options to the crawler constructor as:proxyConfiguration
- provide the proxy configuration to the crawlerrequestHandler
- handle each request with custom router defined in theroutes.js
file.
- Handle requests with the custom router from
routes.js
file. Read more about custom routing for the Cheerio Crawler here- Create a new router instance with
new createPuppeteerRouter()
- Define default handler that will be called for all URLs that are not handled by other handlers by adding
router.addDefaultHandler(() => { ... })
- Define additional handlers - here you can add your own handling of the page
1router.addHandler('detail', async ({ request, page, log }) => { 2 const title = await page.title(); 3 // You can add your own page handling here 4 5 await Dataset.pushData({ 6 url: request.loadedUrl, 7 title, 8 }); 9});
- Create a new router instance with
crawler.run(startUrls);
start the crawler and wait for its finish
Resources
If you're looking for examples or want to learn more visit:
- Crawlee + Apify Platform guide
- Documentation and examples
- Node.js tutorials in Academy
- How to scale Puppeteer and Playwright
- Video guide on getting data using Apify API
- Integration with Make, GitHub, Zapier, Google Drive, and other apps
- A short guide on how to build web scrapers using code templates:
Getting started
For complete information see this article. To run the actor use the following command:
apify run
Deploy to Apify
Connect Git repository to Apify
If you've created a Git repository for the project, you can easily connect to Apify:
- Go to Actor creation page
- Click on Link Git Repository button
Push project on your local machine to Apify
You can also deploy the project on your local machine to Apify without the need for the Git repository.
-
Log in to Apify. You will need to provide your Apify API Token to complete this action.
apify login
-
Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under Actors -> My Actors.
apify push
Documentation reference
To learn more about Apify and Actors, take a look at the following resources:
Actor Metrics
4 monthly users
-
2 stars
25% runs succeeded
Created in Oct 2023
Modified 3 months ago