sahibinden-scraper-puppeteer-js avatar
sahibinden-scraper-puppeteer-js

Under maintenance

Pricing

Pay per usage

Go to Store
sahibinden-scraper-puppeteer-js

sahibinden-scraper-puppeteer-js

Under maintenance

Developed by

Nail Yakupoglu

Maintained by Community

0.0 (0)

Pricing

Pay per usage

1

Monthly users

7

Runs succeeded

>99%

Last modified

9 days ago

Sahibinden.com Web Scraper Documentation

Overview

This document provides comprehensive documentation for the Sahibinden.com web scraper built with Apify. The scraper is designed to extract car listings from Sahibinden.com while bypassing anti-bot measures, and store the data in a structured format in BaseRow to power an AI chatbot for used car price estimation.

Architecture

The solution consists of the following components:

  1. Apify Actor - A cloud-based web scraper that handles the extraction of car listings from Sahibinden.com
  2. Playwright with Stealth - Browser automation with anti-detection capabilities to bypass Cloudflare protection
  3. Residential Proxies - IP rotation to avoid blocking and simulate real user traffic
  4. BaseRow Integration - Data storage in a structured format for easy querying by the AI chatbot
  5. Scheduler - Automated regular scraping to keep data fresh

Technical Implementation

1. Anti-Bot Bypassing Techniques

The scraper implements several techniques to bypass Sahibinden.com's Cloudflare protection:

  • Residential Proxies: Uses Apify's RESIDENTIAL proxy group with Turkey country code to appear as legitimate Turkish users
  • Browser Fingerprinting: Implements realistic browser fingerprinting to avoid detection
  • Stealth Plugin: Uses playwright-stealth to modify browser behavior and evade detection
  • Human-like Behavior: Adds random delays, realistic mouse movements, and proper request headers
  • Non-headless Mode: Runs browser in non-headless mode for better Cloudflare bypass
  • Cookie Management: Sets and maintains cookies to simulate returning users
  • User-Agent Rotation: Rotates between realistic user agents for each request

2. Data Extraction

The scraper extracts comprehensive data from car listings including:

  • Basic information (ID, URL, title, price, location)
  • Vehicle specifications (make, model, year, fuel type, etc.)
  • Detailed attributes (interior/exterior features, safety features, etc.)
  • Technical specifications
  • Images
  • Seller information
  • Description and condition details

The data extraction is implemented in two main functions:

  • handleCategoryPage() - Extracts listing URLs from category pages and handles pagination
  • handleDetailPage() - Extracts comprehensive data from individual car listing pages

3. BaseRow Integration

The BaseRow integration provides:

  • Automatic storage of scraped data in a structured format
  • Duplicate detection and handling
  • Field mapping between scraped data and BaseRow table structure
  • Batch processing for efficient data storage

4. Automation

The scraper is configured for automatic scheduling with:

  • Configurable run frequency (hourly, daily, weekly)
  • Performance monitoring
  • Error handling and retry mechanisms
  • Rate limiting to avoid overloading the target site

Setup Instructions

Prerequisites

  • Apify account with access to residential proxies
  • BaseRow account with a table set up for car listings
  • Node.js 16+ for local development (optional)

Apify Actor Setup

  1. Create a new Actor in your Apify account
  2. Upload the code from this repository to the Actor
  3. Configure Actor settings:
    • Memory: Minimum 4 GB recommended
    • Timeout: At least 30 minutes
    • Environment variables: None required

BaseRow Setup

  1. Create a new table in BaseRow with the following structure:
Field NameTypeDescription
listing_idTextUnique identifier from Sahibinden.com
urlURLFull URL of the listing
titleTextTitle of the car listing
priceNumberNumeric price value
price_currencyTextCurrency of the price (TL, EUR)
locationTextLocation information
descriptionLong textFull description text
makeTextCar make/brand
modelTextCar model
seriesTextCar series
yearTextManufacturing year
fuel_typeTextType of fuel
transmissionTextTransmission type
mileageTextKilometer reading
body_typeTextBody type
engine_powerTextEngine power
engine_capacityTextEngine capacity
drive_typeTextDrive type
doorsTextNumber of doors
colorTextCar color
warrantyTextWarranty information
damage_recordTextDamage record status
plate_nationalityTextPlate/nationality information
seller_typeTextSeller type (dealer, individual)
trade_inTextTrade-in availability
conditionTextCar condition
imagesLong textJSON string of image URLs
attributesLong textJSON string of attributes
technical_specsLong textJSON string of technical specs
scraped_atDate & TimeWhen the data was scraped
last_updatedDate & TimeWhen the record was last updated
  1. Get your BaseRow API credentials:
    • API Token
    • Table ID
    • Database ID

Running the Scraper

Configuration Options

The Actor accepts the following input parameters:

1{
2  "startUrls": ["https://www.sahibinden.com/kategori/vasita"],
3  "maxConcurrency": 1,
4  "maxRequestsPerCrawl": 1000,
5  "proxyConfiguration": {
6    "useApifyProxy": true,
7    "apifyProxyGroups": ["RESIDENTIAL"],
8    "countryCode": "TR"
9  },
10  "baseRowApiToken": "YOUR_BASEROW_API_TOKEN",
11  "baseRowTableId": "YOUR_BASEROW_TABLE_ID",
12  "baseRowDatabaseId": "YOUR_BASEROW_DATABASE_ID",
13  "scheduleInterval": "daily",
14  "startTime": "02:00"
15}

Scheduling

To set up automatic scheduling:

  1. Create a new Schedule in your Apify account
  2. Set the desired frequency (recommended: daily)
  3. Configure the Actor input as shown above
  4. Enable the Schedule

AI Chatbot Integration

The data stored in BaseRow can be used to power an AI chatbot for used car price estimation. The chatbot can:

  1. Take user input describing a car (e.g., "2017 Passat 3 parça boya 150bin km")
  2. Query the BaseRow table for comparable listings
  3. Calculate an estimated price range based on similar vehicles
  4. Adjust the estimate based on factors like damage status, mileage, and year

Query Examples

To find comparable listings in BaseRow, use queries like:

1SELECT * FROM cars 
2WHERE make = 'Volkswagen' 
3AND model = 'Passat' 
4AND year = '2017' 
5AND mileage BETWEEN 130000 AND 170000

Troubleshooting

Common Issues

  1. 403 Forbidden Errors

    • Check that residential proxies are properly configured
    • Verify that the stealth plugin is working correctly
    • Try reducing concurrency to 1
    • Increase delays between requests
  2. Data Extraction Issues

    • Check if Sahibinden.com has changed their page structure
    • Update the selectors in the data extraction functions
    • Verify that the page is fully loaded before extraction
  3. BaseRow Integration Issues

    • Verify API credentials are correct
    • Check that the table structure matches the expected fields
    • Look for API rate limiting issues

Maintenance

To keep the scraper running smoothly:

  1. Regular Monitoring: Check the Actor runs for any errors or performance issues
  2. Updates: Update the code if Sahibinden.com changes their website structure
  3. Proxy Management: Ensure you have sufficient residential proxy credits
  4. Data Cleaning: Periodically clean old or irrelevant listings from BaseRow

Limitations

  • The scraper is designed specifically for Sahibinden.com and may not work on other websites
  • Cloudflare protection methods may change, requiring updates to the bypassing techniques
  • Very high volume scraping may still trigger rate limiting despite all precautions
  • Some listings may have incomplete data if the seller didn't provide all information

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.