Leboncoin extractor
3 days trial then $19.00/month - No credit card required now
Leboncoin extractor
3 days trial then $19.00/month - No credit card required now
Extract information from leboncoin.fr : no limitation, you get fast results in CSV, Excel... or API format. Le meilleur outil de scrapping pour leboncoin
Hi !
I just tried to retrieve an ad on LBC but Apify respond with a "No results". Is this actor still working or is it discontinued?
Best regards,
Run id : https://console.apify.com/actors/xke8akCiaoyOQmnFg/runs/QueUCCpPl0zOFKuF5#output Test url : https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm Log error :
12023-05-25T12:56:33.633Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Error: net::ERR_TOO_MANY_RETRIES at https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm 22023-05-25T12:56:33.635Z at navigate (/home/myuser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:235:23) 32023-05-25T12:56:33.637Z at processTicksAndRejections (node:internal/process/task_queues:96:5) 42023-05-25T12:56:33.640Z at async Frame.goto (/home/myuser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:205:21) 52023-05-25T12:56:33.642Z at async CDPPage.goto (/home/myuser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:1053:16) 62023-05-25T12:56:33.644Z at async PuppeteerCrawler._handleNavigation (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:285:40) 72023-05-25T12:56:33.646Z at async PuppeteerCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:227:13) 82023-05-25T12:56:33.648Z at async PuppeteerCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/puppeteer/internals/puppeteer-crawler.js:110:9) 92023-05-25T12:56:33.650Z at async wrap (/home/myuser/node_modules/@apify/timeout/index.js:52:21) {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","method":"GET","uniqueKey":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm"}
It works on my side. So let me try to debug your case. Could you send my your last run "INPUT" so that I can reproduce your case ?
Hello Guillim,
I just started 1 run with the same settings as you (residential proxy), the default function in your documentation, and the search page url : https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2 I can't see any result. Here is the input :
1{ 2 "pageFunction": "async function pageFunction(context) {\n let data = {}\n let userData = context.request.userData\n data.url = context.request.url\n data.label = userData.label\n // data.title = await context.page.title();\n // context.log.info(data.title);\n\n if(userData && userData.label === 'product'){ \n context.log.info('label product.'); \n data.img = await context.page.locator('[data-qa-id=adview_spotlight_container] img >> nth=0').getAttribute('src')\n data.title = await context.page.locator('[data-qa-id=adview_title] >> nth=0').innerText()\n data.price = await context.page.locator('[data-qa-id=adview_price] >> nth=0').innerText()\n data.date = await context.page.locator('[data-qa-id=adview_date] >> nth=0').innerText()\n data.description = await context.page.locator('[data-qa-id=adview_description_container] >> nth=0').innerText()\n // data.link = userData.link\n }else{\n context.log.info('not label product, so search or pagination.');\n let products = []\n // we are looking for product to be queued, let's write it down\n userData.label = 'product';\n const elements = context.page.locator('[data-qa-id=aditem_container]');\n const links = await elements.evaluateAll(elems => elems.map(elem => \"https://www.leboncoin.fr\"+elem.getAttribute('href')));\n // await context.enqueueRequest('https://www.leboncoin.fr/recherche?category=21&text=got&price=17-50', {test : 'test'}, false);\n links.forEach(async link => {\n await context.enqueueRequest(link, userData , false);\n })\n // data.products = products\n }\n context.log.info(`function ended`);\n return data;\n}\n", 3 "proxyConfiguration": { 4 "useApifyProxy": true, 5 "apifyProxyGroups": [ 6 "RESIDENTIAL" 7 ], 8 "apifyProxyCountry": "FR" 9 }, 10 "startUrls": [ 11 { 12 "url": "https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2" 13 } 14 ] 15}
Here are the logs:
12023-05-26T08:55:40.992Z ACTOR: Pulling Docker image from repository. 22023-05-26T08:55:41.697Z ACTOR: Creating Docker container. 32023-05-26T08:55:41.973Z ACTOR: Starting Docker container. 42023-05-26T08:55:43.277Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp 52023-05-26T08:55:43.281Z Executing main command 62023-05-26T08:55:45.155Z INFO System info {"apifyVersion":"3.1.2","apifyClientVersion":"2.6.2","crawleeVersion":"3.2.2","osType":"Linux","nodeVersion":"v16.19.0"} 72023-05-26T08:55:45.873Z INFO PuppeteerCrawler: Starting the crawl 82023-05-26T08:56:45.872Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60354,"retryHistogram":[]} 92023-05-26T08:56:45.880Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.077},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":0.905},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 102023-05-26T08:56:46.354Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":1} 112023-05-26T08:57:45.875Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120354,"retryHistogram":[]} 122023-05-26T08:57:45.885Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 132023-05-26T08:57:49.951Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":2} 142023-05-26T08:58:45.873Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":180354,"retryHistogram":[]} 152023-05-26T08:58:45.889Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 162023-05-26T08:58:54.367Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":3} 172023-05-26T08:59:45.874Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":240355,"retryHistogram":[]} 182023-05-26T08:59:45.891Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 192023-05-26T08:59:58.081Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","method":"GET","uniqueKey":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2"} 202023-05-26T08:59:58.138Z INFO PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down. 212023-05-26T08:59:58.417Z INFO PuppeteerCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":60243,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":60243,"requestsTotal":1,"crawlerRuntimeMillis":252898} 222023-05-26T08:59:58.418Z INFO PuppeteerCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Navigation timed out after 60 seconds. (/home/myuser/node_modules/@crawlee/core/crawlers/crawler_utils.js:13:11)"]} 232023-05-26T08:59:58.420Z Crawler finished. 242023-05-26T08:59:58.421Z INFO Actor finished successfully (exit code 0)
You may find attached the screenshot.
Best regards,
I also tried with a single ad url : "https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm" Same results. Logs :
12023-05-26T08:53:19.241Z ACTOR: Pulling Docker image from repository. 22023-05-26T08:53:19.399Z ACTOR: Creating Docker container. 32023-05-26T08:53:19.723Z ACTOR: Starting Docker container. 42023-05-26T08:53:20.276Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp 52023-05-26T08:53:20.277Z Executing main command 62023-05-26T08:53:21.783Z INFO System info {"apifyVersion":"3.1.2","apifyClientVersion":"2.6.2","crawleeVersion":"3.2.2","osType":"Linux","nodeVersion":"v16.19.0"} 72023-05-26T08:53:22.943Z INFO PuppeteerCrawler: Starting the crawl 82023-05-26T08:53:58.542Z INFO PuppeteerCrawler: handling: https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm 92023-05-26T08:54:02.540Z INFO PuppeteerCrawler: not label product, so search or pagination. 102023-05-26T08:54:02.541Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. TypeError: context.page.locator is not a function 112023-05-26T08:54:02.542Z at pageFunction (evalmachine.<anonymous>:22:39) 122023-05-26T08:54:02.543Z at file:///home/myuser/main.js:28:24 132023-05-26T08:54:02.544Z at runMicrotasks (<anonymous>) 142023-05-26T08:54:02.544Z at processTicksAndRejections (node:internal/process/task_queues:96:5) 152023-05-26T08:54:02.545Z at async wrap (/home/myuser/node_modules/@apify/timeout/index.js:52:21) {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","retryCount":1} 162023-05-26T08:54:22.943Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60360,"retryHistogram":[]} 172023-05-26T08:54:22.947Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.02},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.036},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 182023-05-26T08:55:06.045Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","retryCount":2} 192023-05-26T08:55:22.943Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120361,"retryHistogram":[]} 202023-05-26T08:55:22.948Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 212023-05-26T08:56:10.133Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","retryCount":3} 222023-05-26T08:56:22.943Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":180360,"retryHistogram":[]} 232023-05-26T08:56:22.953Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 242023-05-26T08:57:13.378Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","method":"GET","uniqueKey":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm"} 252023-05-26T08:57:13.447Z INFO PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down. 262023-05-26T08:57:13.861Z INFO PuppeteerCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":60129,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":60129,"requestsTotal":1,"crawlerRuntimeMillis":231278} 272023-05-26T08:57:13.862Z INFO PuppeteerCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Navigation timed out after 60 seconds. (/home/myuser/node_modules/@crawlee/core/crawlers/crawler_utils.js:13:11)"]} 282023-05-26T08:57:13.863Z Crawler finished. 292023-05-26T08:57:13.864Z INFO Actor finished successfully (exit code 0)
Can you give me your input so I can copy/past it and test it with the same data ?
I think I found out where the problem is. It comes from the customisation of the "Function" running on Leboncoin. Your configuration is almost good.
What to do: On the Apify plateform, click on your "Leboncoin extractor" Actor and on the "Source" tab, click on the "Input" sub-tab. At the bottom of this page, you should be able to click "Restore default input" like in the screeshot attached. After you've clicked, you should see the "Funciton" changed. You will need to write again the URL you want to scrape ( https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2 )
If that does not work, please copy this and paste this in the "Function" editor :
async function pageFunction(context) { let data = {} let userData = context.request.userData data.url = context.request.url data.label = userData.label
1let items = await context.page.evaluate(() => { 2 const item = $('[data-qa-id=aditem_container]') 3 const itemInfo = item.map(function(i,elem) { 4 let obj = {} 5 obj.title = $(this).find('[data-qa-id=aditem_title]').text() 6 obj.price = $(this).find('[data-test-id=price]').text() 7 obj.location = $(this).find('span').filter(function() { return this.title.match(/[0-9]{5}/);}).text() 8 obj.date = $(this).find('span').filter(function() { return this.title.match(/:/);}).text() 9 obj.img = $(this).find('[data-test-id=adcard-consumer-goods-list] img').attr('src') 10 obj.rank = i+1 11 return obj 12 }).get() 13 return itemInfo 14}) 15let itemsWithDataProp = items.map(obj => { 16 for(const key of Object.keys(data) ){ 17 obj[key] = data[key] 18 } 19 return obj 20}) 21return itemsWithDataProp;
}
Hello Guillim,
I tried this morning the "Restore default input". I didn't got any results either. Here is the input :
1{ 2 "pageFunction": "async function pageFunction(context) {\n let data = {}\n let userData = context.request.userData\n data.url = context.request.url\n data.label = userData.label\n\n let items = await context.page.evaluate(() => {\n const item = $('[data-qa-id=aditem_container]')\n const itemInfo = item.map(function(i,elem) {\n let obj = {}\n obj.title = $(this).find('[data-qa-id=aditem_title]').text()\n obj.price = $(this).find('[data-test-id=price]').text()\n obj.location = $(this).find('span').filter(function() { return this.title.match(/[0-9]{5}/);}).text()\n obj.date = $(this).find('span').filter(function() { return this.title.match(/:/);}).text()\n obj.img = $(this).find('[data-test-id=adcard-consumer-goods-list] img').attr('src')\n obj.rank = i+1\n return obj\n }).get()\n return itemInfo\n })\n let itemsWithDataProp = items.map(obj => { \n for(const key of Object.keys(data) ){\n obj[key] = data[key]\n }\n return obj\n })\n return itemsWithDataProp;\n}\n", 3 "proxyConfiguration": { 4 "useApifyProxy": true, 5 "apifyProxyGroups": [ 6 "RESIDENTIAL" 7 ], 8 "apifyProxyCountry": "FR" 9 }, 10 "startUrls": [ 11 { 12 "url": "https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2" 13 } 14 ] 15}
And the log :
12023-05-30T08:53:29.483Z ACTOR: Pulling Docker image from repository. 22023-05-30T08:53:29.638Z ACTOR: Creating Docker container. 32023-05-30T08:53:29.780Z ACTOR: Starting Docker container. 42023-05-30T08:53:30.508Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp 52023-05-30T08:53:30.511Z Executing main command 62023-05-30T08:53:32.233Z INFO System info {"apifyVersion":"3.1.2","apifyClientVersion":"2.6.2","crawleeVersion":"3.2.2","osType":"Linux","nodeVersion":"v16.19.0"} 72023-05-30T08:53:32.923Z INFO PuppeteerCrawler: Starting the crawl 82023-05-30T08:54:32.923Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60380,"retryHistogram":[]} 92023-05-30T08:54:32.927Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 102023-05-30T08:54:33.349Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":1} 112023-05-30T08:55:32.923Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120380,"retryHistogram":[]} 122023-05-30T08:55:32.930Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 132023-05-30T08:55:36.803Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":2} 142023-05-30T08:56:32.923Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":180380,"retryHistogram":[]} 152023-05-30T08:56:32.935Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.075},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 162023-05-30T08:56:40.603Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":3} 172023-05-30T08:57:32.923Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":240380,"retryHistogram":[]} 182023-05-30T08:57:32.939Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 192023-05-30T08:57:44.058Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","method":"GET","uniqueKey":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2"} 202023-05-30T08:57:44.119Z INFO PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down. 212023-05-30T08:57:44.374Z INFO PuppeteerCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":60273,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":60273,"requestsTotal":1,"crawlerRuntimeMillis":251831} 222023-05-30T08:57:44.377Z INFO PuppeteerCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Navigation timed out after 60 seconds. (/home/myuser/node_modules/@crawlee/core/crawlers/crawler_utils.js:13:11)"]} 232023-05-30T08:57:44.379Z Crawler finished. 242023-05-30T08:57:44.382Z INFO Actor finished successfully (exit code 0)
Then I tried with a CC of your input, and still same log :
12023-05-30T17:00:11.545Z ACTOR: Pulling Docker image from repository. 22023-05-30T17:00:12.313Z ACTOR: Creating Docker container. 32023-05-30T17:00:12.487Z ACTOR: Starting Docker container. 42023-05-30T17:00:16.156Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp 52023-05-30T17:00:16.160Z Executing main command 62023-05-30T17:00:19.200Z INFO System info {"apifyVersion":"3.1.2","apifyClientVersion":"2.6.2","crawleeVersion":"3.2.2","osType":"Linux","nodeVersion":"v16.19.0"} 72023-05-30T17:00:20.807Z INFO PuppeteerCrawler: Starting the crawl 82023-05-30T17:01:11.853Z INFO PuppeteerCrawler: handling: https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2 92023-05-30T17:01:20.808Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60679,"retryHistogram":[]} 102023-05-30T17:01:21.856Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Waiting for selector `h1` failed: Waiting failed: 10000ms exceeded 112023-05-30T17:01:21.859Z {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":1} 122023-05-30T17:01:30.809Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 132023-05-30T17:02:20.807Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120679,"retryHistogram":[]} 142023-05-30T17:02:25.439Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":2} 152023-05-30T17:02:30.810Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 162023-05-30T17:03:20.808Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":180679,"retryHistogram":[]} 172023-05-30T17:03:28.934Z WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":3} 182023-05-30T17:03:30.812Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 192023-05-30T17:04:20.808Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":240680,"retryHistogram":[]} 202023-05-30T17:04:30.814Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} 212023-05-30T17:04:32.575Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","method":"GET","uniqueKey":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2"} 222023-05-30T17:04:32.628Z INFO PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down. 232023-05-30T17:04:32.912Z INFO PuppeteerCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":60091,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":60091,"requestsTotal":1,"crawlerRuntimeMillis":252783} 242023-05-30T17:04:32.914Z INFO PuppeteerCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Navigation timed out after 60 seconds. (/home/myuser/node_modules/@crawlee/core/crawlers/crawler_utils.js:13:11)"]} 252023-05-30T17:04:32.916Z Crawler finished. 262023-05-30T17:04:32.949Z INFO Actor finished successfully (exit code 0)
Ok. I think I will have to ping Apify on this one. I don't have enough info to help you straight. I will let you know when they come back to me
Thanks Guillim!
Don't hesitate if you want me to do more tests. Perhaps Apify can share our runs with you so you can inspect the data?
Best regards,
Still working on it (I am not letting you down 😅)
Apify and I found out that leboncoin was blocking your actor, and we don't really know why mine are not blocked. So i increased the bypass-anti-scrapping capabilities of the actor to give you more chances to succeed. The new version of the actor was just released. If you could give it a shot (be sure you are on the latest version of the actor before testing)
Let me know if it works for you !
See attached what leboncoin triggers on your side
Hello Guillim,
It works, thank you ! I manage to get results with a 100% success during my 2 runs.
But I got some questions for you :
- As I understand, to get the details of an ad, I need to first scrap a list page (around 38 results / page) and then scrap each individual ad url for the 38 ads. Am I correct ?
- What is your recommendation regarding the usage of your actor. Especially, do you recommend a crawl frequency ? Do you recommend to crawl multiple pages during the same call or do multiple calls ?
- In terms of time & costs : the scraping of a single page took 55s ($0.045) and the scrapping of 3 other pages took 6m25s ($0.394). Where does the difference come from ? Is the duration and price of a run always so variable ?
Best regards,
good to hear that !
- yes, exaclty
- to avoid ban, the best is to avoid parallel crawling. but it's more up to you, you can test and see.
- it depends on how many tries the crawler needs to do before it bypassses leboncoin anti scrapping protection. The better leboncoin are, the longer it may take. Enjoy !
Hello Guillim,
Thanks for those details.
I will do some more tests, but I'm afraid the costs will be too expensive... My current calculations are around 200$/day just to retrieve the new ads (estimated around 2000/day). And then we need to add the cost of all searches (around 30$/day for a search every 5 minutes). And the cost of refreshing old ads to see if there are still online... (more than 400$/day if we check them every 2 weeks)
Do you see any way to decrease these costs by crawling multiple ads in a row in the same run ? Or any other mean you can think about !
Best regards,
I agree with you : if your calculation is correct, that’s way too much.
There are different tricks I had to setup to bypass Leboncoin anti scrapping protection. One of them is creating a real browser instead of just a simulation, and it costs more. But, it’s one of the way to make sure it works.
Depending on your balance « make sure it works » VS « reduce cost » I could remove this feature.
honestly, it’s really hard to find a cheap solution when fighting antiscrapping. If a website tells you differently and ne the web, it’s probably a scam 😅
Hello Guillim,
Thanks for your concerns. I understand that LBC protections are quite advanced. If having a selenium (or equivalent) is the only way to bypass them, so be it. I understand that LBC also have a private API for searches, I can see references to api.leboncoin.fr in the DOM. Would it be possible to make calls to the API from inside the robot web browser ? If so, perhaps we could get more data from the same session and reduce costs ?
I agree scraping LBC is not cheap... The only way to have a cheaper solution would be to share the data across multiple clients, so that we can only crawl it once and retrieve it several times. But I am not aware of such services ?
Regards,
Hi,
There are not so many options to bypass anti-scrapping protection, they require quite high skilled scripts, and heavy crawlers. The API is also protected by the API from what I could read here : https://github.com/tdurieux/leboncoin-api/blob/master/README.md
One of the last option would be to store the data of ads, and fetch only new ads while occurring in search results. Requires some dev on your part I guess.
But yes, you would reduce significantly cost sharing the scrapped data between your customers. That's what most website do, even though they say the opposite. There are no services doing exactly what you would want, so you could dev your own solution, or you could maybe try some combination of zapier and xano.
Hi,
Yes the API must be very protected because it exposes data already formatted, very quickly and in batch... the dream ! But I don't know how the protection is working... they can't have fingerprints nor user agents. Perhaps we will try to limit the ads we crawl to reduce the costs. But this is not ideal.
On our side, we already have developed a solution to store ads data and avoid multiple crawls. We have other website sources and we use IP proxies to avoid detections. It works well, excepted for Leboncoin.fr and Seloger.com because of DataDome. By the way, would it be possible to crawl Seloger.com with your actor ?
We used Apify for Leboncoin only because we don't have the knowledge nor internal resources to bypass DataDome. But it's more expensive than expected... it's getting closer to the cost of an external contractor. You don't do freelance by any chance ? ;)
Best regards,
Hi
No sorry I don’t do freelance anymore. Apify actors are what it’s left of my side projects. I cannot dedicate more time than some actor maintenance or development.
If you assure me you will be a paying client for this actor, I can spend a few days to develop “seloger” and release it on apify, for the same rates as Leboncoin.
To be honest, you will never find any contactor or any technology that would really compete with apify actor pricing considering real crawlers. That’s my sole opinion of course, and definitely biased.
If you find a better deal, let me know, I would use it as well !
Hi,
Noted, no freelancing anymore, thanks Guillim. I'd like to test the LBC actor at a real scale and see the final costs before engaging on Seloger. If we are at 300€/month/actor, we need to ensure it's profitable. But it's good to know you can add it !
I think Apify has good prices. It's just that we also have proxies and servers internally, so we pay 2 services instead of one. Paying around 50$/month for each actor/marketplace on Apify would be too expensive. For a big marketplace like Leboncoin it's possible though. I saw other similar services like RapidApi, but never tested them...
One last question to start implementing a solution with you actor. Is it possible to retrieve the whole HTML through the API and do the extracting on our side ? I think it can lower the processing time and therefore the costs. Something like that :
1async function pageFunction(context) { 2response = { 3 url: context.request.url, 4 html: context.page, 5 userData: context.request.userData, 6 } 7return response 8}
Best regards,
Sure it would possible, but it would have no effect on the costs though. JS execution at this stage is incredibly fast. It’s really emulating chrome that is expensive when starting an actor
Yes, you are right, the processing time in JS won't be very different. But regarding the costs, my experience is that if we can retrieve and store the HTML on our side, we can fix any scraping issue and re-run the extraction without having to do a new run.
I did some some tests again to extract the while HTML, but I didn't get any results (see attachment). Did you make any changes ? When I look at the logs, it seems to be the safety blocking again ?
12023-06-16T06:46:17.414Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Error: net::ERR_TOO_MANY_RETRIES at https://www.leboncoin.fr/recherche?category=9&locations=r_12&owner_type=private&real_estate_type=1%2C2 22023-06-16T06:46:17.415Z at navigate (/home/myuser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:111:23)
I didn’t change anything. Might be some random blocking that the crawler couldn’t bypass in 15 tries. Happens sometimes.
It worked after 4 tries (see PJ). But it takes a total of 37 minutes and 1,3$ to get 38 ads (without the detail pages). I'll continue to integrate the actor to test the solution during 1 month and see the real cost. But if you can find ways to reduce time/cost, it'll be greatly appreciated !
Ok, I'll check it out.
- 18 monthly users
- 6 stars
- 98.3% runs succeeded
- Created in Oct 2021
- Modified 5 months ago