Kaggle Scraper
1 day trial then $19.99/month - No credit card required now
Kaggle Scraper
1 day trial then $19.99/month - No credit card required now
Efficiently extracts dataset information from Kaggle based on user-defined search terms. Collects datasets metadata, categories, usability ratings and file information. Customizable scraping depth. Ideal for researchers and data scientists seeking quick insights into Kaggle datasets.
Kaggle Dataset Scraper
This project is an Apify actor designed to scrape dataset information from Kaggle. It collects details such as dataset name, owner, creator, categories, and more based on a given search term.
Features
- Searches for datasets on Kaggle based on a given search term.
- Collects comprehensive information about each dataset, including metadata, categories, and usability ratings.
- Allows specifying the number of pages to scrape.
- Saves the collected data to the Apify dataset.
Usage
- Run this actor in the Apify console.
- Provide the desired inputs:
- searchTerm: The topic to search for datasets on Kaggle.
- totalPages: The total number of pages to scrape (default: 3).
Example Input
1{ 2 "searchTerm": "football", 3 "totalPages": 5 4}
Output
The collected data is saved to the Apify dataset. The output data includes the following fields:
datasetUrl
: URL of the datasetownerAvatarUrl
: URL of the owner's avatar imageownerName
: Name of the dataset ownerownerUrl
: URL of the owner's profileownerUserId
: User ID of the ownerownerTier
: Tier of the owner (e.g., CONTRIBUTOR)creatorName
: Name of the dataset creatorcreatorUrl
: URL of the creator's profilecreatorUserId
: User ID of the creatorscriptCount
: Number of scripts associated with the datasetscriptsUrl
: URL to view scripts related to the datasetforumUrl
: URL of the dataset's discussion forumviewCount
: Number of viewsdownloadCount
: Number of downloadsdateCreated
: Creation date of the datasetdateUpdated
: Last update date of the datasettotalVotes
: Total number of votesdatasetId
: Unique identifier of the datasetcategories
: List of categories the dataset belongs to, including:id
: Category IDname
: Category namefullPath
: Full path of the categorydescription
: Description of the categorydatasetCount
: Number of datasets in the categorycompetitionCount
: Number of competitions in the categorynotebookCount
: Number of notebooks in the categorymodelCount
: Number of models in the category
licenseName
: Name of the dataset licenselicenseShortName
: Short name of the licensedatasetSize
: Size of the datasetcommonFileTypes
: Types of files in the dataset, including:fileType
: Type of the filecount
: Number of files of this typetotalSize
: Total size of files of this type
downloadUrl
: URL to download the datasetusabilityRating
: Various usability scores for the dataset, including:score
: Overall usability scorecolumnDescriptionScore
: Score for column descriptionscoverImageScore
: Score for cover imagefileDescriptionScore
: Score for file descriptionsfileFormatScore
: Score for file formatsoverviewScore
: Score for dataset overviewpublicKernelScore
: Score for public kernelssubtitleScore
: Score for subtitletagScore
: Score for tags
datasetSlug
: Slug of the datasetrank
: Rank of the datasetmedalUrl
: URL of the medal image (if any)datasource
: Additional metadata about the dataset source, including:datasetId
: Dataset IDcurrentDatasetVersionId
: Current version ID of the datasetcurrentDatasetVersionNumber
: Current version number of the datasettype
: Type of the datasourcediffType
: Type of differencetitle
: Title of the datasetoverview
: Overview of the datasetthumbnailImageUrl
: URL of the dataset thumbnail image
Example Output
1{ 2 "datasetUrl": "https://www.kaggle.com/datasets/secareanualin/football-events", 3 "ownerAvatarUrl": "https://storage.googleapis.com/kaggle-avatars/thumbnails/360904-fb.jpg", 4 "ownerName": "Alin Secareanu", 5 "ownerUrl": "https://www.kaggle.com/secareanualin", 6 "ownerUserId": 360904, 7 "ownerTier": "CONTRIBUTOR", 8 "creatorName": "Alin Secareanu", 9 "creatorUrl": "https://www.kaggle.com/secareanualin", 10 "creatorUserId": 360904, 11 "scriptCount": 111, 12 "scriptsUrl": "https://www.kaggle.com/datasets/secareanualin/football-events/kernels", 13 "forumUrl": "https://www.kaggle.com/datasets/secareanualin/football-events/discussion", 14 "viewCount": 222949, 15 "downloadCount": 31376, 16 "dateCreated": "2017-01-25T01:19:19.890Z", 17 "dateUpdated": "2017-01-25T01:19:19.907Z", 18 "totalVotes": 708, 19 "datasetId": 712, 20 "categories": [ 21 { 22 "id": 2200, 23 "name": "arts and entertainment", 24 "fullPath": "subject > arts and entertainment", 25 "description": "Activities that holds the attention and interest of an audience, or gives pleasure and delight. It can be an idea or a task, but is more likely to be one of the activities or events that have developed over thousands of years specifically for the purpose of keeping an audience's attention.", 26 "datasetCount": 51232, 27 "competitionCount": 3, 28 "notebookCount": 22807, 29 "modelCount": 4 30 }, 31 { 32 "id": 2500, 33 "name": "games", 34 "fullPath": "subject > culture and humanities > games", 35 "description": "One of the hallmarks of intelligence is the use of games and toys to occupy free time and develop intellectually. Often stored in Mom's basement.", 36 "datasetCount": 7935, 37 "competitionCount": 7, 38 "notebookCount": 34503, 39 "modelCount": 6 40 }, 41 { 42 "id": 2603, 43 "name": "football", 44 "fullPath": "subject > health and fitness > exercise > sports > football", 45 "description": "Some call it association football, some call it soccer, most call it sport ball. Come analyze the teams and players of the beautiful game.", 46 "datasetCount": 4085, 47 "competitionCount": 13, 48 "notebookCount": 970, 49 "modelCount": 1 50 } 51 ], 52 "licenseName": "Unknown", 53 "licenseShortName": "Unknown", 54 "datasetSize": "21.12 MB", 55 "commonFileTypes": [ 56 { 57 "fileType": "DATASET_FILE_TYPE_CSV", 58 "count": 2, 59 "totalSize": "174.44 MB" 60 }, 61 { 62 "fileType": "DATASET_FILE_TYPE_OTHER", 63 "count": 1, 64 "totalSize": "1.28 KB" 65 } 66 ], 67 "downloadUrl": "https://www.kaggle.com/datasets/secareanualin/football-events/download?datasetVersionNumber=1", 68 "usabilityRating": { 69 "score": 0.7647059, 70 "columnDescriptionScore": 1, 71 "coverImageScore": 1, 72 "fileDescriptionScore": 1, 73 "fileFormatScore": 1, 74 "overviewScore": 1, 75 "publicKernelScore": 1, 76 "subtitleScore": 1, 77 "tagScore": 1 78 }, 79 "datasetSlug": "football-events", 80 "rank": 1, 81 "medalUrl": "https://www.kaggle.com/static/images/medals/datasets/goldl@2x.png", 82 "datasource": { 83 "datasetId": 712, 84 "currentDatasetVersionId": 1336, 85 "currentDatasetVersionNumber": 1, 86 "type": "FILESET", 87 "diffType": "VERSIONED", 88 "title": "Football Events", 89 "overview": "More than 900,000 events from 9,074 football games across Europe", 90 "thumbnailImageUrl": "https://storage.googleapis.com/kaggle-datasets-images/712/1336/32bcaf498efc8122a07392edf71416c0/dataset-thumbnail.jpg" 91 } 92 }
This example output shows the structured data of a single dataset. The actual output will be a list of similar objects for all processed datasets.
Notes
- The collected data is stored in Apify’s default data store.
- 2 monthly users
- 1 star
- 100.0% runs succeeded
- Created in Oct 2024
- Modified about 1 month ago