seloger mass products scraper (by search URL) ⚡
3 days trial then $25.00/month - No credit card required now
seloger mass products scraper (by search URL) ⚡
3 days trial then $25.00/month - No credit card required now
🔥Très simple! Entrez le lien vers la page de recherche et obtenir les résultats! ⚡ Extraire rapidement les infos détaillées sur les propriétés ( titre, description, photos, évaluations énergétique prix, contacts, transport et plus encore) à faible coût, avec exportation en JSON, CSV, HTML, EXCEL...
Hi it's seems since last version (0.0.172) actor crash at startup with my url :
with log : 2024-12-13T18:27:13.713Z ACTOR: Pulling Docker image of build jZtjIWkrVu7xa6wDJ from repository. 2024-12-13T18:27:14.412Z ACTOR: Creating Docker container. 2024-12-13T18:27:14.511Z ACTOR: Starting Docker container. 2024-12-13T18:27:18.781Z INFO System info {"apifyVersion":"3.2.6","apifyClientVersion":"2.10.0","crawleeVersion":"3.12.1","osType":"Linux","nodeVersion":"v20.18.1"} 2024-12-13T18:27:18.930Z bypassing bot protection... Please be patient :) 2024-12-13T18:28:16.408Z WARN Request: We've encountered a POST Request with a payload. This is fine. Just letting you know that if your requests point to the same URL and differ only in method and payload, you should see the "useExtendedUniqueKey" option of Request constructor. 2024-12-13T18:28:30.094Z pass through....
and finish without any result or query, can you fix that ?
thanks
Hi!
checking this...
Well, seems like Apify FR proxies were flagged and blocked by datadome. I'll set it up to use my own proxies (Which I pay for) in one hour and later will change it so Actor's users will input their own proxies as an input (something that I didn't aiming to keep it as simple as possible for users)
I have a question: I noticed that you're always trying to scrape all of the 7K listings. Is that needed? (I'm asking to understand your use case)
I'll update it to use more EU proxies origin countries and not only France, should help. Will let you know.
Should be up again! Could you please confirm? And please reach out to me and let's discuss your specific use-case and see whether improvements/adjustments could be made :) My Discord username is @azzouzana
I try to fetch new announces with specific filters one Time by day
Thanks for your help I will test it
I can plan to work on a mode that would definitely help you out so the actor, based on previous executions outcome, would only scrape new listings & return delisted items
Hi, I’ve released the Delta Mode feature, which, based on a checkbox input, instructs the actor to return only new or delisted ads since its last run. To use it, please use the version 0.1. Test it out with a small listing count first and let me know how it works. Thank you!
I try new version but this fail with message : 2024-12-18T04:36:22.055Z Not paying user, only handling first 50 results. To get all results, please subscribe (can you check it ?)
Some points if this can help you in future :
- an easy way to improve perf and avoid "caching" can be an input param to select a date, and only scrape with details "announce" after this date (this can be more efficient than fetch all)
- other quick tips : limit the announce number to fetch
Thanks
Thanks a lot for the feedback.
I've just pushed an attempt for isPaying check, please let me know how it goes. (If you're a paid Apify user & you're still facing that, please let me know, most likely something with Apify platform but I believe it should be good). Now regardless, and to test the new monitoring mode, could you please try doing so with a listing that that less than 50 and let me know your feedback.
- an easy way to improve perf and avoid "caching" can be an input param to select a date, and only scrape with details "announce" after this date (this can be more efficient than fetch all) => Thanks. definitely makes sense! Noted!
- other quick tips : limit the announce number to fetch => I previously worked on this but it didn't work well with monitoring mode enabled, will have to think about this again. Probably they have to be mutually exclusive.
Hi,
I progress in my testing, it's seems better but I encounter error :
12024-12-20T01:29:01.023Z /usr/src/app/node_modules/@crawlee/core/storages/dataset.js:41 22024-12-20T01:29:01.026Z throw new Error(`Data item${s}is too large (size: ${bytes} bytes, limit: ${limitBytes} bytes)`); 32024-12-20T01:29:01.028Z ^ 42024-12-20T01:29:01.030Z 52024-12-20T01:29:01.032Z Error: Data item is too large (size: 71529285 bytes, limit: 9436240 bytes) 62024-12-20T01:29:01.034Z at checkAndSerialize (/usr/src/app/node_modules/@crawlee/core/storages/dataset.js:41:15) 72024-12-20T01:29:01.036Z at Dataset.pushData (/usr/src/app/node_modules/@crawlee/core/storages/dataset.js:206:29) 82024-12-20T01:29:01.038Z at Actor.pushData (/usr/src/app/node_modules/apify/actor.js:527:24) 92024-12-20T01:29:01.040Z at process.processTicksAndRejections (node:internal/process/task_queues:95:5) 102024-12-20T01:29:01.043Z at async file:///usr/src/app/src/main.js:86:5
Can you catch this error and continue process ?
Best
Something with the dataset size limit. Please share the run and will check this first thing tomorrow. (Also, did you confirm monitoring is Ok with a not-so-large search results?)
You can find the run here : https://console.apify.com/organization/TzEYl4RGm5rKPyOU5/actors/dqFjeUv7Nrv7lRatk/runs/ZMRSNKZwQFCvDOmOs
Regarding documentation (if this can help) :
1The size of the data is limited by the receiving API and therefore pushData() will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size. 2 3The function internally chunks the array into separate items and pushes them sequentially. The chunking process is stable (keeps order of data), but it does not provide a transaction safety mechanism. Therefore, in the event of an uploading error (after several automatic retries), the function's Promise will reject and the dataset will be left in a state where some of the items have already been saved to the dataset while other items from the source array were not. To overcome this limitation, the developer may, for example, read the last item saved in the dataset and re-attempt the save of the data from this item onwards to prevent duplicates.
Regarding monitoring mode with a small base it's seems to work has expected
Thanks
I think this is related as now you try to send 1 line with everything in "newsAds" (and for a big results set you reach 9MB limit)
I think you should use the same output as before (1 line by announce) and maybe add a "state": "new" or "state": "delisted" in the row, this will be more usefull to debug and check results in the console.
Thank for the feedback!
For the size limitation, that's definitely it. Will work on it this weekend.
hello any news ?
Hey 👋 I've worked on adjusting the delta mode, and there's a field "apify_monitoring_status" which signifies whether the ad is new or delisted. Could you test it out with a small listening and let me know. Thanks!
Hey, it's seems this work (yay) but I have an issue regarding monitoring mode.
Between each execution, it's seems monitoring mode detect all urls as "new" (and so crawl all list) can you share how you identify ad as "new", can you confirm if this is based on permalink without parameter ?
Thanks
If this can help you, it seems monitoring detection occur after fetching because at the end, I have a dataset output only with "new" and "delisted". Can you update code to only "deep scrape" if "new" only ?
Best
Hi!
Thanks for the feedback & catching that bug and thanks also for bearing up with me :) Definitely worth it. Working on it today. I'll let you know.
Hi,
Happy new year, I come back to you about the issue (can you fix it) ?
Thanks
It's seems ok now .. Thanks for you help
Hi & happy new year!
This was OK since last week but I forgot to follow up here. Thanks for your feedback! I'm closing this issue now.
Actor Metrics
7 monthly users
-
2 stars
>99% runs succeeded
1.6 hours response time
Created in Jul 2024
Modified 24 days ago