GPT Scraper avatar

GPT Scraper

Try for free

Pay $9.00 for 1,000 pages

View all Actors
GPT Scraper

GPT Scraper

drobnikj/gpt-scraper
Try for free

Pay $9.00 for 1,000 pages

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Do you want to learn more about this Actor?

Get a demo
BL

Scraper Fails with Google Play Store Urls

Open

blkbox opened this issue
10 days ago

Error logs

12024-09-10T11:49:34.199Z ACTOR: Pulling Docker image of build nXM2iV00mWsXHDDu6 from repository.
22024-09-10T11:49:34.297Z ACTOR: Creating Docker container.
32024-09-10T11:49:35.463Z ACTOR: Starting Docker container.
42024-09-10T11:49:36.674Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp
52024-09-10T11:49:36.676Z Executing main command
62024-09-10T11:49:39.630Z INFO  System info {"apifyVersion":"3.1.16","apifyClientVersion":"2.9.3","crawleeVersion":"3.8.1","osType":"Linux","nodeVersion":"v18.20.4"}
72024-09-10T11:49:39.773Z INFO  Max pages per crawl: 1
82024-09-10T11:49:40.462Z INFO  Configuration completed. Starting the crawl.
92024-09-10T11:49:40.549Z INFO  PlaywrightCrawler: Starting the crawler.
102024-09-10T11:49:45.697Z INFO  Opening https://play.google.com/store/apps/details?id=com.hardrockdigital.client&hl=en_US&gl=US...
112024-09-10T11:49:46.292Z WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.evaluate: TypeError: Failed to execute 'parseFromString' on 'DOMParser': This document requires 'TrustedHTML' assignment.
122024-09-10T11:49:46.295Z     at eval (eval at evaluate (:226:30), <anonymous>:2:37)
132024-09-10T11:49:46.298Z     at UtilityScript.evaluate (<anonymous>:228:17)
142024-09-10T11:49:46.300Z     at UtilityScript.<anonymous> (<anonymous>:1:44)
152024-09-10T11:49:46.303Z     at shrinkHtml (/home/myuser/dist/processors.js:10:33) {"id":"u95wgQVoq9JiBtG","url":"https://play.google.com/store/apps/details?id=com.hardrockdigital.client&hl=en_US&gl=US","retryCount":1}
162024-09-10T11:49:49.777Z INFO  PlaywrightCrawler: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon. Requests that are in progress will be allowed to finish.
172024-09-10T11:49:49.779Z INFO  PlaywrightCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 1 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 1 requests and will shut down.
182024-09-10T11:49:50.703Z INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,1],"requestAvgFailedDurationMillis":201,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":5,"requestTotalDurationMillis":201,"requestsTotal":1,"crawlerRuntimeMillis":10240}
192024-09-10T11:49:50.706Z INFO  PlaywrightCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Skipping this page (file:///home/myuser/dist/crawler.js:27:33)"]}
202024-09-10T11:49:50.708Z INFO  PlaywrightCrawler: Finished! Total 1 requests: 0 succeeded, 1 failed. {"terminal":true}
212024-09-10T11:49:50.717Z INFO  Crawler finished.
BL

blkbox

10 days ago

Input used

1{
2  "startUrls": [
3    {
4      "url": "https://play.google.com/store/apps/details?id=com.hardrockdigital.client&hl=en_US&gl=US"
5    }
6  ],
7  "maxPagesPerCrawl": 1,
8  "maxCrawlingDepth": 1,
9  "linkSelector": "a[href]",
10  "instructions": " \"I have a website link that provides information about a specific brand. \"\n          \"Please visit the link and extract detailed information about the brand.\n\"\n          \"Include the following details in your response:\n\n\"\n          \"Brand Name: The official name of the brand.\n\"\n          \"Overview: A brief description of what the brand is about, including its mission, vision, and values.\n\"\n          \"Products/Services: A list of main products or services offered by the brand.\n\"\n          \"Category of products: Provide only the name of categories of product\"\n          \"Unique selling points:\n\"\n          \"Target Audience: The primary demographic or market segment that the brand caters to.\n\"\n          \"Contact Information: How to get in touch with the brand, including customer service, social media handles, and physical addresses if available.\n\"\n          \"Website Link: The original website link provided for reference.\n\"\n          \"Social Media Handles:\n\n\"\n          Provide the output in JSON format with keys representing above points and values representing the details.\n           ",
11  "extractor": {
12    "type": "gpt",
13    "model": "gpt-4",
14    "maxTokens": 4000
15  },
16  "temperature": "0.1",
17  "top_p": "0.1",
18  "response_format": "json_object",
19  "includeUrlGlobs": [],
20  "excludeUrlGlobs": [],
21  "initialCookies": [],
22  "proxyConfiguration": {
23    "useApifyProxy": false
24  },
25  "topP": "1",
26  "frequencyPenalty": "0",
27  "presencePenalty": "0",
28  "removeElementsCssSelector": "script, style, noscript, path, svg, xlink",
29  "pageFormatInRequest": "Markdown",
30  "dynamicContentWaitSecs": 0,
31  "removeLinkUrls": false,
32  "saveSnapshots": true
33}
lukas.prusa avatar

Hi, thanks for opening this issue!

I can see that you've found the original issue from a different user for the same website here - https://console.apify.com/actors/paOtbjvyUiNsr1Qms/issues/puLBJmEVQSvgzuXw7

Seeing more users having the same problem, we will prioritize this issue more :)

I will keep you updated here, thanks!

BL

blkbox

a day ago

Hey Lukas,

Thanks for the response, can we do anything to put this on top of your priority list ?

Context: We are a bootstrapped startup just launched this product, expecting to see scrapings around 4-5K pages per day.

This a blocker for our business, we will be forced to look for alternatives if this doesn't work.

PS: We love your tool it has unlocked key capabilities for us. Would be happy to setup a chat with our engineering team, if that can help resolve the issue quicker.

BL

blkbox

a day ago

I tried scraping the same url, with apify/playwright-scraper , seems to be working there.

lukas.prusa avatar

Thanks for your insight, we will try to finish it this week :)

It's most likely just some very stupid bug in our backend that's somehow magically clashing with this very specific website, so it's just incredibly annoying to catch.

BL

blkbox

14 hours ago

Thank you looking forward to the fix. :)

Happy debugging, a rubber duck might help ;)

Developer
Maintained by Apify
Actor metrics
  • 194 monthly users
  • 48 stars
  • 99.0% runs succeeded
  • 1.5 days response time
  • Created in Mar 2023
  • Modified about 1 month ago