Kaggle Scraper avatar

Kaggle Scraper

Try for free

1 day trial then $19.99/month - No credit card required now

View all Actors
Kaggle Scraper

Kaggle Scraper

muhammetakkurtt/kaggle-scraper
Try for free

1 day trial then $19.99/month - No credit card required now

Efficiently extracts dataset information from Kaggle based on user-defined search terms. Collects datasets metadata, categories, usability ratings and file information. Customizable scraping depth. Ideal for researchers and data scientists seeking quick insights into Kaggle datasets.

Kaggle Dataset Scraper

Kaggle Dataset Scraper

This project is an Apify actor designed to scrape dataset information from Kaggle. It collects details such as dataset name, owner, creator, categories, and more based on a given search term.

Features

  • Searches for datasets on Kaggle based on a given search term.
  • Collects comprehensive information about each dataset, including metadata, categories, and usability ratings.
  • Allows specifying the number of pages to scrape.
  • Saves the collected data to the Apify dataset.

Usage

  1. Run this actor in the Apify console.
  2. Provide the desired inputs:
    • searchTerm: The topic to search for datasets on Kaggle.
    • totalPages: The total number of pages to scrape (default: 3).

Example Input

1{
2  "searchTerm": "football",
3  "totalPages": 5
4}

Output

The collected data is saved to the Apify dataset. The output data includes the following fields:

  • datasetUrl: URL of the dataset
  • ownerAvatarUrl: URL of the owner's avatar image
  • ownerName: Name of the dataset owner
  • ownerUrl: URL of the owner's profile
  • ownerUserId: User ID of the owner
  • ownerTier: Tier of the owner (e.g., CONTRIBUTOR)
  • creatorName: Name of the dataset creator
  • creatorUrl: URL of the creator's profile
  • creatorUserId: User ID of the creator
  • scriptCount: Number of scripts associated with the dataset
  • scriptsUrl: URL to view scripts related to the dataset
  • forumUrl: URL of the dataset's discussion forum
  • viewCount: Number of views
  • downloadCount: Number of downloads
  • dateCreated: Creation date of the dataset
  • dateUpdated: Last update date of the dataset
  • totalVotes: Total number of votes
  • datasetId: Unique identifier of the dataset
  • categories: List of categories the dataset belongs to, including:
    • id: Category ID
    • name: Category name
    • fullPath: Full path of the category
    • description: Description of the category
    • datasetCount: Number of datasets in the category
    • competitionCount: Number of competitions in the category
    • notebookCount: Number of notebooks in the category
    • modelCount: Number of models in the category
  • licenseName: Name of the dataset license
  • licenseShortName: Short name of the license
  • datasetSize: Size of the dataset
  • commonFileTypes: Types of files in the dataset, including:
    • fileType: Type of the file
    • count: Number of files of this type
    • totalSize: Total size of files of this type
  • downloadUrl: URL to download the dataset
  • usabilityRating: Various usability scores for the dataset, including:
    • score: Overall usability score
    • columnDescriptionScore: Score for column descriptions
    • coverImageScore: Score for cover image
    • fileDescriptionScore: Score for file descriptions
    • fileFormatScore: Score for file formats
    • overviewScore: Score for dataset overview
    • publicKernelScore: Score for public kernels
    • subtitleScore: Score for subtitle
    • tagScore: Score for tags
  • datasetSlug: Slug of the dataset
  • rank: Rank of the dataset
  • medalUrl: URL of the medal image (if any)
  • datasource: Additional metadata about the dataset source, including:
    • datasetId: Dataset ID
    • currentDatasetVersionId: Current version ID of the dataset
    • currentDatasetVersionNumber: Current version number of the dataset
    • type: Type of the datasource
    • diffType: Type of difference
    • title: Title of the dataset
    • overview: Overview of the dataset
    • thumbnailImageUrl: URL of the dataset thumbnail image

Example Output

1{
2    "datasetUrl": "https://www.kaggle.com/datasets/secareanualin/football-events",
3    "ownerAvatarUrl": "https://storage.googleapis.com/kaggle-avatars/thumbnails/360904-fb.jpg",
4    "ownerName": "Alin Secareanu",
5    "ownerUrl": "https://www.kaggle.com/secareanualin",
6    "ownerUserId": 360904,
7    "ownerTier": "CONTRIBUTOR",
8    "creatorName": "Alin Secareanu",
9    "creatorUrl": "https://www.kaggle.com/secareanualin",
10    "creatorUserId": 360904,
11    "scriptCount": 111,
12    "scriptsUrl": "https://www.kaggle.com/datasets/secareanualin/football-events/kernels",
13    "forumUrl": "https://www.kaggle.com/datasets/secareanualin/football-events/discussion",
14    "viewCount": 222949,
15    "downloadCount": 31376,
16    "dateCreated": "2017-01-25T01:19:19.890Z",
17    "dateUpdated": "2017-01-25T01:19:19.907Z",
18    "totalVotes": 708,
19    "datasetId": 712,
20    "categories": [
21      {
22        "id": 2200,
23        "name": "arts and entertainment",
24        "fullPath": "subject > arts and entertainment",
25        "description": "Activities that holds the attention and interest of an audience, or gives pleasure and delight. It can be an idea or a task, but is more likely to be one of the activities or events that have developed over thousands of years specifically for the purpose of keeping an audience's attention.",
26        "datasetCount": 51232,
27        "competitionCount": 3,
28        "notebookCount": 22807,
29        "modelCount": 4
30      },
31      {
32        "id": 2500,
33        "name": "games",
34        "fullPath": "subject > culture and humanities > games",
35        "description": "One of the hallmarks of intelligence is the use of games and toys to occupy free time and develop intellectually. Often stored in Mom's basement.",
36        "datasetCount": 7935,
37        "competitionCount": 7,
38        "notebookCount": 34503,
39        "modelCount": 6
40      },
41      {
42        "id": 2603,
43        "name": "football",
44        "fullPath": "subject > health and fitness > exercise > sports > football",
45        "description": "Some call it association football, some call it soccer, most call it sport ball. Come analyze the teams and players of the beautiful game.",
46        "datasetCount": 4085,
47        "competitionCount": 13,
48        "notebookCount": 970,
49        "modelCount": 1
50      }
51    ],
52    "licenseName": "Unknown",
53    "licenseShortName": "Unknown",
54    "datasetSize": "21.12 MB",
55    "commonFileTypes": [
56      {
57        "fileType": "DATASET_FILE_TYPE_CSV",
58        "count": 2,
59        "totalSize": "174.44 MB"
60      },
61      {
62        "fileType": "DATASET_FILE_TYPE_OTHER",
63        "count": 1,
64        "totalSize": "1.28 KB"
65      }
66    ],
67    "downloadUrl": "https://www.kaggle.com/datasets/secareanualin/football-events/download?datasetVersionNumber=1",
68    "usabilityRating": {
69      "score": 0.7647059,
70      "columnDescriptionScore": 1,
71      "coverImageScore": 1,
72      "fileDescriptionScore": 1,
73      "fileFormatScore": 1,
74      "overviewScore": 1,
75      "publicKernelScore": 1,
76      "subtitleScore": 1,
77      "tagScore": 1
78    },
79    "datasetSlug": "football-events",
80    "rank": 1,
81    "medalUrl": "https://www.kaggle.com/static/images/medals/datasets/goldl@2x.png",
82    "datasource": {
83      "datasetId": 712,
84      "currentDatasetVersionId": 1336,
85      "currentDatasetVersionNumber": 1,
86      "type": "FILESET",
87      "diffType": "VERSIONED",
88      "title": "Football Events",
89      "overview": "More than 900,000 events from 9,074 football games across Europe",
90      "thumbnailImageUrl": "https://storage.googleapis.com/kaggle-datasets-images/712/1336/32bcaf498efc8122a07392edf71416c0/dataset-thumbnail.jpg"
91    }
92  }

This example output shows the structured data of a single dataset. The actual output will be a list of similar objects for all processed datasets.

Notes

  • The collected data is stored in Apify’s default data store.
Developer
Maintained by Community
Actor metrics
  • 2 monthly users
  • 1 star
  • 100.0% runs succeeded
  • Created in Oct 2024
  • Modified about 1 month ago