Metadata Scraper

Pricing

$20.00/month + usage

Try for free

Go to Apify Store

Metadata Scraper

Try for free

Developed by

Louis Deconinck

Maintained by Community

Automatically scrape metadata such as title, description, heading and article from websites. It will crawl the start URLs and then scrape the metadata from the detail pages automatically navigating through the pagination.

5.0 (3)

Pricing

$20.00/month + usage

100

Last modified

a year ago

Automation

Real estate

Features

Scrapes metadata from specified websites
Handles pagination and detail pages
Extracts title, description, heading, and article content
Configurable start URLs and maximum requests per crawl
Ignores specified URLs so no duplicates when scraping multiple times

Input

Be sure to use JSON mode for the input and not Manual mode. Here's an overview of the input parameters:

startUrls: An array of objects containing:
- url: The starting URL for the scrape
- scrapeUrlGlobs: An array of URL patterns for detail pages to scrape
- paginationUrlGlobs: An array of URL patterns for pagination pages (optional)
maxRequestsPerCrawl: Maximum number of requests per crawl (default: 100)
urlsToIgnore: An array of URLs to ignore when processing (optional)

Here's an example of the input data structure:

{
  "startUrls": [
    {
      "url": "https://roger-hannah.co.uk/property-search/?search_properties=1&tenure=&property_type%5B%5D=Development&property_type%5B%5D=Industrial&size_min=0&size_max=1000000",
      "scrapeUrlGlobs": ["https://roger-hannah.co.uk/properties/*"],
      "paginationUrlGlobs": []
    }
  ],
  "maxRequestsPerCrawl": 100,
  "urlsToIgnore": [
    "https://roger-hannah.co.uk/properties/development-site-with-potential-for-10-houses-planning-permission/",
    "https://roger-hannah.co.uk/properties/lower-mill-mill-street/"
  ]
}

Using Glob Patterns

Glob patterns are used to match URLs. They are similar to regular expressions but more flexible. They are used to match the URL patterns for detail pages and pagination pages.

Here are some common glob patterns used in URL matching:

*: Matches any number of characters (except /) Example: https://example.com/*.html matches all HTML files in the root directory
**: Matches any number of characters (including /) Example: https://example.com/**/*.jpg matches all JPG files in any subdirectory
?: Matches exactly one character Example: https://example.com/page?.html matches page1.html, pageA.html, etc.
[...]: Matches any one character in the brackets Example: https://example.com/file[123].txt matches file1.txt, file2.txt, file3.txt
[!...]: Matches any one character not in the brackets Example: https://example.com/img[!0-9].png matches imgA.png but not img1.png
{...}: Matches any of the comma-separated patterns Example: https://example.com/{blog,news}/*.html matches both blog and news HTML files

Examples in the context of web scraping:

https://example.com/products/*.html: Matches all product detail pages
https://example.com/category/*/page-*.html: Matches pagination pages in all categories
https://example.com/{2021,2022,2023}/**: Matches all pages from specific years
https://example.com/page/*: Matches all pages in the root directory
https://example.com/page/**: Matches all pages in all subdirectories

When using glob patterns in the startGlobs configuration, make sure they accurately represent the structure of the website you're scraping to ensure all relevant pages are captured.

Output

The Actor outputs the following data for each scraped property listing:

url: The URL of the scraped page
title: The title of a detail page
description: The description of a detail page
heading: The main heading of a detail page
article: The content of a detail page

Here's an example of the output data structure:

{
  "url": "https://roger-hannah.co.uk/properties/bolton-street/",
  "title": "Bolton Street - Roger Hannah",
  "description": "Property Information The property comprises of a detached former warehouse/showroom facility constructed by way of a steel portal frame with concrete render under a pitched tiled roof. Access to the property is via personnel entrance doors fronting Bolton Street with rear loading access off Millett Street via two electrically operated roller shutter loading doors. There is a small private yard/parking/loading area to the rear of the premises. Internally, the facility provided flexible ground fl...",
  "heading": "Bolton Street",
  "article": "Property Information The property comprises of a detached former warehouse/showroom facility constructed by way of a steel portal frame with concrete render under a pitched tiled roof. Access to the property is via personnel entrance doors fronting Bolton Street with rear loading access off Millett Street via two electrically operated roller shutter loading doors. There is a small private yard/parking/loading area to the rear of the premises. Internally, the facility provided flexible ground fl..."
}

On this page

Metadata Scraper

Share Actor:

Metadata Scraper

autofacts/metadata-scraper

A powerful web scraper that extracts various types of structured metadata from web pages, including JSON-LD, Microdata, Open Graph, Twitter Cards, and more. Perfect for SEO analysis, content aggregation, and research purposes.

Autofactor

5.0

Metadata Extractor

jancurn/extract-metadata

A small efficient actor that loads a web page, parses its HTML using Cheerio library and extracts the following meta-data from the <HEAD> tag, such as page title, description, author etc.

Jan Čurn

1.4K

URL Metadata Crawler

easyapi/url-metadata-crawler

Extracting comprehensive metadata from web pages. Gather vital information like meta tags, favicons, Open Graph tags, and more, all while enjoying flexible options for customization. Perfect for SEO specialists, developers, and content creators looking to enhance their web presence! 🌐

EasyApi

Camoufox Scraper

josef.prochazka/camoufox-scraper

Simple actor that uses Playwright with Camoufox to test if a specific website blocking mechanisms can be bypassed by using Camoufox.

Josef Procházka

Website Metadata Extractor (meta tags, sitemap, robots) 🔎

powerful_bachelor/website-metadata-extractor

🔍 Website Metadata Extractor 🌐 Extract essential website data: meta tags, robots.txt, and sitemap.xml in one scan. 📊 Analyze SEO elements, crawler directives, and site structure. ✅ Perfect for SEO audits, 🔎 competitor research, and 🚀 understanding how search engines view your website.

Powerful Bachelor

SEO/GEO - Schema Markup Scraper

wisteria_banjo/schema-markup-scraper

This actor to fetches JSON-LD/Schema Markup from Multiple URLs & checks whether the page contains markups for the following types: AggregateRating, Article, Event, FAQPage, LocalBusiness, Organization, Person, Product, & Review. Schema Markup helps search and generative engines find & read webpages.

Chris Xavier

URL to Metadata - mail, social and more

njoylab/url-summary-scraper

A powerful Apify actor that extracts essential website information, including title, description, images, mail, and social media links. Perfect for quick data gathering and insights from any URL.

njoylab

5.0

Meta Data Extractor

dainty_screw/metadata-extractor-reliable-web-page-metadata-extraction

Metadata Extractor is your go-to tool for extracting meta-data from web pages. Using Cheerio, it parses HTML to extract titles, descriptions, authors, and more.Perfect for content managers and SEO experts.

codemaster devops

Magento Website Detector

kvantis/magento-website-detector

🔍 Instantly detect Magento 1.x/2.x websites with 90%+ accuracy. Perfect for lead generation, competitor analysis & market research. Bulk process URLs, get confidence scores & detailed evidence. Ideal for agencies, developers & researchers. Input multiple URLs, get structured results.