Doctolib avatar

Doctolib

Try for free

3 days trial then $19.00/month - No credit card required now

View all Actors
Doctolib

Doctolib

anchor/doctolib
Try for free

3 days trial then $19.00/month - No credit card required now

Scraping Doctolib is now super easy and cheap! Extract phones, names, contact, timings, image and addresses of medics, doctors, hospitals... Best part : you can even customize what info to extract from Doctolib!

SE

Scraping not complete

Closed

serjio opened this issue
a month ago

Hi! I used your actor from Apify to scrape doctors from Doctolib.fr but ran into a problem: with a browser, this search (https://www.doctolib.fr/medecin-generaliste/france?language=16) returns 81 results, but the scraper returns only 20 (the log says 21, but the first string is empty). See the log attached. Could you please suggest what could be the source of the problem? Or is it due to the site's protection from scrapers?

SE

serjio

a month ago

After the fix, the scraper found 30 pages but saved 0 results (Timeout error), see the log attached

anchor avatar

guillim (anchor)

a month ago

Thanks for your issue here :)

There is one thing you might try : reset the "pageFunction" to the default value. Let me know if this fixes it. What I think that causes the problem is that you may have been updated to the version 0.5 of the Actor but it kept your last INPUT. Since I made changes to the pagefunction, it needs to be updated as well Or if you prefer, here is the JSON version you can use as the INPUT :

{ "hideSearchPages": true, "maxPagesPerCrawl": 90, "pageFunction": "async function pageFunction(context) {\n\n let data = {}\n let userData = context.request.userData\n data.url = context.request.url\n data.label = userData.label\n \n if(userData && userData.label === 'doctor'){ \n data.nom = await context.page.locator('#main-content h1').innerText({timeout:6000})\n data.tarif = await context.innerTextwrapper(context,'#payment_means')\n data.horaire_contact = await context.innerTextwrapper(context,'#openings_and_contact')\n data.description = await context.innerTextwrapper(context,'.dl-profile-bio')\n data.specialite = await context.innerTextwrapper(context,'.dl-profile-header-speciality')\n data.expertise = await context.innerTextwrapper(context,'#skills')\n try{\n data.phones = await context.getPhones(data.horaire_contact)\n }catch(e){\n context.log.info('Phones not found',e); \n }\n try{\n data.image = await context.page.locator('.dl-profile img').first().getAttribute('src',{timeout:2000})\n if(data.image.startsWith('/')){ data.image = 'https:' + data.image}\n }catch(e){\n context.log.info('Image not found',e); \n } \n \n }else{\n context.log.info('we are not on a doctor page: so a search or pagination page.');\n userData.label = 'doctor';\n const elements = context.page.locator('.search-result-card a[href]');\n const links = await elements.evaluateAll(elems => elems.map(elem => elem.getAttribute('href')));\n let extenstion = 'fr'\n if(context.request.url.includes('doctolib.de')){ extenstion = 'de' }\n if(context.request.url.includes('doctolib.it')){ extenstion = 'it' }\n links.forEach(async link => {\n if(link.startsWith('/')){ link = https://www.doctolib.${extenstion}${link} }\n await context.enqueueRequest(link, userData , false);\n })\n\n }\n context.log.info(ending this page now);\n delete data.label\n return data;\n}\n", "startUrls": [ { "url": "https://www.doctolib.de/rheumatologie/deutschland?page=2" } ] }

anchor avatar

guillim (anchor)

22 days ago

Guessing this worked so closing the issue. feel free to reopen if necessary

Developer
Maintained by Community
Actor metrics
  • 10 monthly users
  • 4 stars
  • 89.8% runs succeeded
  • 7.6 hours response time
  • Created in Jul 2022
  • Modified 18 days ago