An intelligent Apify Actor that uses Claude AI to automatically discover, test, and select the best Apify actors for your web scraping tasks. No manual configuration needed!

✨ Features

🧠 AI-Powered Actor Discovery: Uses Claude AI to automatically find and test the best Apify actors for your target website
🔄 Smart Retry Logic: Automatically adjusts parameters and retries failed attempts with different actors
📊 Quality Assessment: Evaluates scraped data quality across multiple dimensions (completeness, relevance, structure, volume)
🎯 Priority-Based Testing: Tests domain-specific actors first, then falls back to general-purpose ones
📈 Real-time Progress: Tracks and reports scraping progress with detailed logging
🔗 MCP Integration: Connects to Apify MCP Server for dynamic actor discovery and execution
⚙️ Flexible Configuration: Extensive customization options for timeout, quality thresholds, and model selection

🚀 Quick Start

Set up your Claude API key in the Actor input or as an environment variable
Provide your target URL and describe what data you want to extract
Run the Actor - it will automatically find and test the best scraping approach

Example Input

{
    "targetUrl": "https://example-ecommerce.com/products/",
    "extractionGoal": "Extract product information including title, price, description, and availability",
    "claudeApiKey": "sk-ant-api03-...",
    "maxActorAttempts": 5,
    "maxTimeMinutes": 20
}

📝 Input Configuration

Required Fields

targetUrl: The URL of the website you want to scrape
extractionGoal: Describe what data you want to extract from the website
claudeApiKey: Your Anthropic Claude API key for AI-powered analysis

Optional Configuration

maxActorAttempts (default: 10): Maximum number of different actors to try
maxRetriesPerActor (default: 3): Maximum retry attempts per actor
maxTimeMinutes (default: 30): Maximum total execution time in minutes
modelName (default: "claude-3-5-haiku-latest"): Claude model to use
debugMode (default: false): Enable detailed logging
preferSpecificActors (default: true): Prioritize domain-specific actors
minDataQualityScore (default: 70): Minimum quality score (0-100) to accept results
enableProxy (default: true): Use proxy for scraping requests

Available Claude Models

claude-3-5-haiku-latest - Fast & cost-effective (recommended)
claude-3-5-sonnet-latest - Balanced performance and quality
claude-3-opus-latest - Maximum quality (slower, more expensive)

📊 Output

The Actor saves results to:

Dataset

Each scraped item with metadata:

{
    "url": "https://example.com",
    "data": {...},
    "quality_score": 0.85,
    "actor_used": "apify/web-scraper",
    "timestamp": "2025-07-24T11:30:00Z",
    "success": true,
    "extraction_goal": "Extract product information",
    "total_execution_time": 45.2,
    "attempts_made": 3
}

Key-Value Store

Summary information in SCRAPING_RESULT:

{
    "success": true,
    "quality_score": 0.85,
    "items_count": 25,
    "best_actor_id": "apify/web-scraper",
    "total_execution_time": 45.2,
    "attempts_made": 3,
    "progress_updates": [...],
    "actor_attempts": [...]
}

🔧 How It Works

Actor Discovery: Connects to Apify MCP Server to discover available actors
AI Analysis: Uses Claude to analyze the target website and select appropriate actors
Smart Testing: Tests actors in priority order with intelligent parameter adjustment
Quality Evaluation: Assesses data quality using multiple metrics
Retry Logic: Automatically retries with different parameters if needed
Result Selection: Returns the best results based on quality scores

🏗️ Architecture

The Actor consists of several key components:

MCP Client (src/llmscraper/mcp/): Handles communication with Apify MCP Server
Claude Manager (src/llmscraper/claude/): Manages AI conversations and tool calls
LLM Scraper Actor (src/llmscraper/llm_scraper/): Main orchestration logic
Retry Logic (src/llmscraper/llm_scraper/retry_logic.py): Intelligent parameter adjustment
Quality Evaluator (src/llmscraper/llm_scraper/quality_evaluator.py): Data quality assessment

🔑 Environment Variables

ANTHROPIC_API_KEY: Your Anthropic Claude API key (alternative to input field)
APIFY_TOKEN: Automatically provided by Apify platform
MCP_SERVER_URL: Custom MCP server URL (optional)

⚡ Performance Tips

Use Haiku Model: For most tasks, claude-3-5-haiku-latest provides the best speed/cost ratio
Adjust Attempts: Reduce maxActorAttempts for faster results, increase for better coverage
Quality Threshold: Lower minDataQualityScore if you're getting no results
Time Limits: Set appropriate maxTimeMinutes based on your needs

🛠️ Development

Local Testing

# Install dependencies (using virtual environment)
pip install -r requirements.txt

# Or if you have the project's virtual environment:
./venv/bin/pip install -r requirements.txt

# Set up environment
export ANTHROPIC_API_KEY=your_key_here

# Run the actor locally
python3 main.py

# Or using npm scripts:
npm run start          # Uses system python3
npm run start:local    # Uses project virtual environment

Project Structure

LLMScraper/
├── main.py                           # Actor entry point
├── src/llmscraper/
│   ├── mcp/                         # MCP client implementation
│   ├── claude/                      # Claude AI integration
│   ├── llm_scraper/                 # Main scraper logic
│   │   ├── actor.py                 # Main LLMScraperActor class
│   │   ├── models.py                # Input/output models
│   │   ├── retry_logic.py           # Intelligent retry logic
│   │   └── quality_evaluator.py     # Data quality assessment
│   ├── scraping/                    # Apify actor integrations
│   └── utils/                       # Configuration and utilities
├── .actor/
│   ├── actor.json                   # Actor metadata
│   ├── input_schema.json            # Input validation schema
│   └── README.md                    # This file
├── Dockerfile                       # Container configuration
├── requirements.txt                 # Python dependencies
├── package.json                     # Node.js metadata
└── pyproject.toml                   # Python packaging configuration

📚 API Reference

Main Function

from llmscraper.llm_scraper import LLMScraperActor, LLMScraperInput

# Create configuration
config = LLMScraperInput(
    target_url="https://example-website.com",
    extraction_goal="Extract product data",
    anthropic_api_key="sk-ant-..."
)

# Run the scraper
scraper = LLMScraperActor(config)
result = await scraper.run(progress_callback=None)

Configuration

from llmscraper.llm_scraper.models import LLMScraperInput

config = LLMScraperInput(
    target_url="https://example-website.com",
    extraction_goal="Extract product data",
    anthropic_api_key="sk-ant-...",
    max_actor_attempts=10,
    max_retries_per_actor=3,
    max_time_minutes=30,
    model_name="claude-3-5-haiku-latest",
    debug_mode=False,
    prefer_specific_actors=True,
    min_data_quality_score=0.7,  # Note: API expects 0.0-1.0, input form uses 0-100
    enable_proxy=True
)

📄 License

MIT License - see LICENSE file for details.

🆘 Support & Troubleshooting

Common Issues

API Key Issues: Ensure your Claude API key is valid and has sufficient credits
No Results Found: Try reducing minDataQualityScore or increasing maxActorAttempts
Timeout Errors: Increase maxTimeMinutes for complex websites
Quality Score Too Low: Adjust your extractionGoal to be more specific

Debugging

Enable debugMode: true for detailed logging
Check the Actor logs for step-by-step execution details
Verify the target URL is accessible and returns content
Monitor the progress updates in the key-value store

Performance Optimization

Use claude-3-5-haiku-latest for faster, cost-effective processing
Set appropriate maxActorAttempts based on your time/quality requirements
Enable preferSpecificActors to prioritize domain-specific solutions

Programming Language Detector

maged120/programming-language-detector

this Actor identifies the programming language with high accuracy, providing confidence scores. Powered by advanced pattern matching and heuristic analysis, it supports over 100 programming languages and frameworks

Maged

5.0

Extended GPT Scraper

drobnikj/extended-gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

1.5K

4.8

AI Web Scraper - Powered by Crawl4AI

raizen/ai-web-scraper

A blazing-fast AI web scraper powered by Crawl4AI. Perfect for LLMs, AI agents, AI automation, model training, sentiment analysis, and content generation. Supports deep crawling, multiple extraction strategies and flexible output (Markdown/JSON). Seamlessly integrates with Make.com, n8n, and Zapier.

Raizen Technology

225

1.0

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

123

Universal AI GPT Scraper

louisdeconinck/ai-gpt-scraper

Transform any website into structured data with AI-powered extraction. This versatile tool combines advanced web scraping with intelligent content analysis to deliver clean, customized JSON output - perfect for automating data collection from any web source.

Louis Deconinck

113

5.0

Natural Language Dataset Query

apify/natural-language-dataset-query

Use natural language queries to retrieve results from an Apify dataset. This Actor provides a query engine that loads a dataset, executes SQL queries, and synthesizes results. It works as an MCP (Model Context Protocol) server or REST API in Actor standby mode.

Apify

GPT Scraper

drobnikj/gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

6.1K

4.0

Smart Scrape AI

llayaa112/smart-scrape-ai

Smart Scrape AI is an autonomous web automation and scraping actor powered by Playwright and AI. It dynamically interprets prompts, navigates websites, performs tasks, extracts data, and provides intelligent answers. Ideal for zero-code, prompt-driven data extraction and interaction workflows.

laya albshlawy

ScraperCodeGenerator

ohlava/ScraperCodeGenerator

An intelligent web scraping tool that automatically generates custom scraping code for any website.

Ondřej Hlava

Smartcontext AI Web Crawler

bluelightco/smartcontext-ai-crawler

Scrape any website and extract structured data using AI-powered instructions. Provide URLs and a natural language prompt to get tailored JSON outputs.

Bluelight

5.0

LLMScraper

LLMScraper

🤖 LLM-Powered Web Scraper

✨ Features

🚀 Quick Start

Example Input

📝 Input Configuration

Required Fields

Optional Configuration

Available Claude Models

📊 Output

Dataset

Key-Value Store

🔧 How It Works

🏗️ Architecture

🔑 Environment Variables

⚡ Performance Tips

🛠️ Development

Local Testing

Project Structure

📚 API Reference

Main Function

Configuration

📄 License

🆘 Support & Troubleshooting

Common Issues

Debugging

Performance Optimization

You might also like

Programming Language Detector

Extended GPT Scraper

AI Web Scraper - Powered by Crawl4AI

AI-Powered Web Content & Link Extractor

Universal AI GPT Scraper

Natural Language Dataset Query

GPT Scraper

Smart Scrape AI

ScraperCodeGenerator

Smartcontext AI Web Crawler