LLMScraper avatar
LLMScraper

Pricing

Pay per usage

Go to Store
LLMScraper

LLMScraper

Developed by

Ondřej Hlava

Ondřej Hlava

Maintained by Community

Find best scraper for your website and data you need.

0.0 (0)

Pricing

Pay per usage

0

Total users

1

Monthly users

1

Last modified

5 days ago

🤖 LLM-Powered Web Scraper

An intelligent Apify Actor that uses Claude AI to automatically discover, test, and select the best Apify actors for your web scraping tasks. No manual configuration needed!

✨ Features

  • 🧠 AI-Powered Actor Discovery: Uses Claude AI to automatically find and test the best Apify actors for your target website
  • 🔄 Smart Retry Logic: Automatically adjusts parameters and retries failed attempts with different actors
  • 📊 Quality Assessment: Evaluates scraped data quality across multiple dimensions (completeness, relevance, structure, volume)
  • 🎯 Priority-Based Testing: Tests domain-specific actors first, then falls back to general-purpose ones
  • 📈 Real-time Progress: Tracks and reports scraping progress with detailed logging
  • 🔗 MCP Integration: Connects to Apify MCP Server for dynamic actor discovery and execution
  • ⚙️ Flexible Configuration: Extensive customization options for timeout, quality thresholds, and model selection

🚀 Quick Start

  1. Set up your Claude API key in the Actor input or as an environment variable
  2. Provide your target URL and describe what data you want to extract
  3. Run the Actor - it will automatically find and test the best scraping approach

Example Input

{
"targetUrl": "https://example-ecommerce.com/products/",
"extractionGoal": "Extract product information including title, price, description, and availability",
"claudeApiKey": "sk-ant-api03-...",
"maxActorAttempts": 5,
"maxTimeMinutes": 20
}

📝 Input Configuration

Required Fields

  • targetUrl: The URL of the website you want to scrape
  • extractionGoal: Describe what data you want to extract from the website
  • claudeApiKey: Your Anthropic Claude API key for AI-powered analysis

Optional Configuration

  • maxActorAttempts (default: 10): Maximum number of different actors to try
  • maxRetriesPerActor (default: 3): Maximum retry attempts per actor
  • maxTimeMinutes (default: 30): Maximum total execution time in minutes
  • modelName (default: "claude-3-5-haiku-latest"): Claude model to use
  • debugMode (default: false): Enable detailed logging
  • preferSpecificActors (default: true): Prioritize domain-specific actors
  • minDataQualityScore (default: 70): Minimum quality score (0-100) to accept results
  • enableProxy (default: true): Use proxy for scraping requests

Available Claude Models

  • claude-3-5-haiku-latest - Fast & cost-effective (recommended)
  • claude-3-5-sonnet-latest - Balanced performance and quality
  • claude-3-opus-latest - Maximum quality (slower, more expensive)

📊 Output

The Actor saves results to:

Dataset

Each scraped item with metadata:

{
"url": "https://example.com",
"data": {...},
"quality_score": 0.85,
"actor_used": "apify/web-scraper",
"timestamp": "2025-07-24T11:30:00Z",
"success": true,
"extraction_goal": "Extract product information",
"total_execution_time": 45.2,
"attempts_made": 3
}

Key-Value Store

Summary information in SCRAPING_RESULT:

{
"success": true,
"quality_score": 0.85,
"items_count": 25,
"best_actor_id": "apify/web-scraper",
"total_execution_time": 45.2,
"attempts_made": 3,
"progress_updates": [...],
"actor_attempts": [...]
}

🔧 How It Works

  1. Actor Discovery: Connects to Apify MCP Server to discover available actors
  2. AI Analysis: Uses Claude to analyze the target website and select appropriate actors
  3. Smart Testing: Tests actors in priority order with intelligent parameter adjustment
  4. Quality Evaluation: Assesses data quality using multiple metrics
  5. Retry Logic: Automatically retries with different parameters if needed
  6. Result Selection: Returns the best results based on quality scores

🏗️ Architecture

The Actor consists of several key components:

  • MCP Client (src/llmscraper/mcp/): Handles communication with Apify MCP Server
  • Claude Manager (src/llmscraper/claude/): Manages AI conversations and tool calls
  • LLM Scraper Actor (src/llmscraper/llm_scraper/): Main orchestration logic
  • Retry Logic (src/llmscraper/llm_scraper/retry_logic.py): Intelligent parameter adjustment
  • Quality Evaluator (src/llmscraper/llm_scraper/quality_evaluator.py): Data quality assessment

🔑 Environment Variables

  • ANTHROPIC_API_KEY: Your Anthropic Claude API key (alternative to input field)
  • APIFY_TOKEN: Automatically provided by Apify platform
  • MCP_SERVER_URL: Custom MCP server URL (optional)

⚡ Performance Tips

  1. Use Haiku Model: For most tasks, claude-3-5-haiku-latest provides the best speed/cost ratio
  2. Adjust Attempts: Reduce maxActorAttempts for faster results, increase for better coverage
  3. Quality Threshold: Lower minDataQualityScore if you're getting no results
  4. Time Limits: Set appropriate maxTimeMinutes based on your needs

🛠️ Development

Local Testing

# Install dependencies (using virtual environment)
pip install -r requirements.txt
# Or if you have the project's virtual environment:
./venv/bin/pip install -r requirements.txt
# Set up environment
export ANTHROPIC_API_KEY=your_key_here
# Run the actor locally
python3 main.py
# Or using npm scripts:
npm run start # Uses system python3
npm run start:local # Uses project virtual environment

Project Structure

LLMScraper/
├── main.py # Actor entry point
├── src/llmscraper/
│ ├── mcp/ # MCP client implementation
│ ├── claude/ # Claude AI integration
│ ├── llm_scraper/ # Main scraper logic
│ │ ├── actor.py # Main LLMScraperActor class
│ │ ├── models.py # Input/output models
│ │ ├── retry_logic.py # Intelligent retry logic
│ │ └── quality_evaluator.py # Data quality assessment
│ ├── scraping/ # Apify actor integrations
│ └── utils/ # Configuration and utilities
├── .actor/
│ ├── actor.json # Actor metadata
│ ├── input_schema.json # Input validation schema
│ └── README.md # This file
├── Dockerfile # Container configuration
├── requirements.txt # Python dependencies
├── package.json # Node.js metadata
└── pyproject.toml # Python packaging configuration

📚 API Reference

Main Function

from llmscraper.llm_scraper import LLMScraperActor, LLMScraperInput
# Create configuration
config = LLMScraperInput(
target_url="https://example-website.com",
extraction_goal="Extract product data",
anthropic_api_key="sk-ant-..."
)
# Run the scraper
scraper = LLMScraperActor(config)
result = await scraper.run(progress_callback=None)

Configuration

from llmscraper.llm_scraper.models import LLMScraperInput
config = LLMScraperInput(
target_url="https://example-website.com",
extraction_goal="Extract product data",
anthropic_api_key="sk-ant-...",
max_actor_attempts=10,
max_retries_per_actor=3,
max_time_minutes=30,
model_name="claude-3-5-haiku-latest",
debug_mode=False,
prefer_specific_actors=True,
min_data_quality_score=0.7, # Note: API expects 0.0-1.0, input form uses 0-100
enable_proxy=True
)

📄 License

MIT License - see LICENSE file for details.

🆘 Support & Troubleshooting

Common Issues

  • API Key Issues: Ensure your Claude API key is valid and has sufficient credits
  • No Results Found: Try reducing minDataQualityScore or increasing maxActorAttempts
  • Timeout Errors: Increase maxTimeMinutes for complex websites
  • Quality Score Too Low: Adjust your extractionGoal to be more specific

Debugging

  • Enable debugMode: true for detailed logging
  • Check the Actor logs for step-by-step execution details
  • Verify the target URL is accessible and returns content
  • Monitor the progress updates in the key-value store

Performance Optimization

  • Use claude-3-5-haiku-latest for faster, cost-effective processing
  • Set appropriate maxActorAttempts based on your time/quality requirements
  • Enable preferSpecificActors to prioritize domain-specific solutions