sahibinden-scraper-puppeteer-js
Under maintenance
Pricing
Pay per usage
sahibinden-scraper-puppeteer-js
Under maintenance
0.0 (0)
Pricing
Pay per usage
1
Monthly users
7
Runs succeeded
>99%
Last modified
9 days ago
Sahibinden.com Web Scraper Documentation
Overview
This document provides comprehensive documentation for the Sahibinden.com web scraper built with Apify. The scraper is designed to extract car listings from Sahibinden.com while bypassing anti-bot measures, and store the data in a structured format in BaseRow to power an AI chatbot for used car price estimation.
Architecture
The solution consists of the following components:
- Apify Actor - A cloud-based web scraper that handles the extraction of car listings from Sahibinden.com
- Playwright with Stealth - Browser automation with anti-detection capabilities to bypass Cloudflare protection
- Residential Proxies - IP rotation to avoid blocking and simulate real user traffic
- BaseRow Integration - Data storage in a structured format for easy querying by the AI chatbot
- Scheduler - Automated regular scraping to keep data fresh
Technical Implementation
1. Anti-Bot Bypassing Techniques
The scraper implements several techniques to bypass Sahibinden.com's Cloudflare protection:
- Residential Proxies: Uses Apify's RESIDENTIAL proxy group with Turkey country code to appear as legitimate Turkish users
- Browser Fingerprinting: Implements realistic browser fingerprinting to avoid detection
- Stealth Plugin: Uses playwright-stealth to modify browser behavior and evade detection
- Human-like Behavior: Adds random delays, realistic mouse movements, and proper request headers
- Non-headless Mode: Runs browser in non-headless mode for better Cloudflare bypass
- Cookie Management: Sets and maintains cookies to simulate returning users
- User-Agent Rotation: Rotates between realistic user agents for each request
2. Data Extraction
The scraper extracts comprehensive data from car listings including:
- Basic information (ID, URL, title, price, location)
- Vehicle specifications (make, model, year, fuel type, etc.)
- Detailed attributes (interior/exterior features, safety features, etc.)
- Technical specifications
- Images
- Seller information
- Description and condition details
The data extraction is implemented in two main functions:
handleCategoryPage()
- Extracts listing URLs from category pages and handles paginationhandleDetailPage()
- Extracts comprehensive data from individual car listing pages
3. BaseRow Integration
The BaseRow integration provides:
- Automatic storage of scraped data in a structured format
- Duplicate detection and handling
- Field mapping between scraped data and BaseRow table structure
- Batch processing for efficient data storage
4. Automation
The scraper is configured for automatic scheduling with:
- Configurable run frequency (hourly, daily, weekly)
- Performance monitoring
- Error handling and retry mechanisms
- Rate limiting to avoid overloading the target site
Setup Instructions
Prerequisites
- Apify account with access to residential proxies
- BaseRow account with a table set up for car listings
- Node.js 16+ for local development (optional)
Apify Actor Setup
- Create a new Actor in your Apify account
- Upload the code from this repository to the Actor
- Configure Actor settings:
- Memory: Minimum 4 GB recommended
- Timeout: At least 30 minutes
- Environment variables: None required
BaseRow Setup
- Create a new table in BaseRow with the following structure:
Field Name | Type | Description |
---|---|---|
listing_id | Text | Unique identifier from Sahibinden.com |
url | URL | Full URL of the listing |
title | Text | Title of the car listing |
price | Number | Numeric price value |
price_currency | Text | Currency of the price (TL, EUR) |
location | Text | Location information |
description | Long text | Full description text |
make | Text | Car make/brand |
model | Text | Car model |
series | Text | Car series |
year | Text | Manufacturing year |
fuel_type | Text | Type of fuel |
transmission | Text | Transmission type |
mileage | Text | Kilometer reading |
body_type | Text | Body type |
engine_power | Text | Engine power |
engine_capacity | Text | Engine capacity |
drive_type | Text | Drive type |
doors | Text | Number of doors |
color | Text | Car color |
warranty | Text | Warranty information |
damage_record | Text | Damage record status |
plate_nationality | Text | Plate/nationality information |
seller_type | Text | Seller type (dealer, individual) |
trade_in | Text | Trade-in availability |
condition | Text | Car condition |
images | Long text | JSON string of image URLs |
attributes | Long text | JSON string of attributes |
technical_specs | Long text | JSON string of technical specs |
scraped_at | Date & Time | When the data was scraped |
last_updated | Date & Time | When the record was last updated |
- Get your BaseRow API credentials:
- API Token
- Table ID
- Database ID
Running the Scraper
Configuration Options
The Actor accepts the following input parameters:
1{ 2 "startUrls": ["https://www.sahibinden.com/kategori/vasita"], 3 "maxConcurrency": 1, 4 "maxRequestsPerCrawl": 1000, 5 "proxyConfiguration": { 6 "useApifyProxy": true, 7 "apifyProxyGroups": ["RESIDENTIAL"], 8 "countryCode": "TR" 9 }, 10 "baseRowApiToken": "YOUR_BASEROW_API_TOKEN", 11 "baseRowTableId": "YOUR_BASEROW_TABLE_ID", 12 "baseRowDatabaseId": "YOUR_BASEROW_DATABASE_ID", 13 "scheduleInterval": "daily", 14 "startTime": "02:00" 15}
Scheduling
To set up automatic scheduling:
- Create a new Schedule in your Apify account
- Set the desired frequency (recommended: daily)
- Configure the Actor input as shown above
- Enable the Schedule
AI Chatbot Integration
The data stored in BaseRow can be used to power an AI chatbot for used car price estimation. The chatbot can:
- Take user input describing a car (e.g., "2017 Passat 3 parça boya 150bin km")
- Query the BaseRow table for comparable listings
- Calculate an estimated price range based on similar vehicles
- Adjust the estimate based on factors like damage status, mileage, and year
Query Examples
To find comparable listings in BaseRow, use queries like:
1SELECT * FROM cars 2WHERE make = 'Volkswagen' 3AND model = 'Passat' 4AND year = '2017' 5AND mileage BETWEEN 130000 AND 170000
Troubleshooting
Common Issues
-
403 Forbidden Errors
- Check that residential proxies are properly configured
- Verify that the stealth plugin is working correctly
- Try reducing concurrency to 1
- Increase delays between requests
-
Data Extraction Issues
- Check if Sahibinden.com has changed their page structure
- Update the selectors in the data extraction functions
- Verify that the page is fully loaded before extraction
-
BaseRow Integration Issues
- Verify API credentials are correct
- Check that the table structure matches the expected fields
- Look for API rate limiting issues
Maintenance
To keep the scraper running smoothly:
- Regular Monitoring: Check the Actor runs for any errors or performance issues
- Updates: Update the code if Sahibinden.com changes their website structure
- Proxy Management: Ensure you have sufficient residential proxy credits
- Data Cleaning: Periodically clean old or irrelevant listings from BaseRow
Limitations
- The scraper is designed specifically for Sahibinden.com and may not work on other websites
- Cloudflare protection methods may change, requiring updates to the bypassing techniques
- Very high volume scraping may still trigger rate limiting despite all precautions
- Some listings may have incomplete data if the seller didn't provide all information
Pricing
Pricing model
Pay per usageThis Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.