Bluesky Jetstream Scraper avatar
Bluesky Jetstream Scraper

Pricing

$25.00/month + usage

Go to Store
Bluesky Jetstream Scraper

Bluesky Jetstream Scraper

Developed by

june

Maintained by Community

Bluesky Social Feed Scraper collects posts from Bluesky's Jetstream API. Filter by hashtags, usernames, or languages to gather targeted data. Includes media attachments, user profiles, and reply context. Perfect for social research, trend analysis, and content monitoring on the platform.

0.0 (0)

Pricing

$25.00/month + usage

0

Monthly users

1

Runs succeeded

>99%

Last modified

3 days ago

🌊 Bluesky Jetstream Scraper

The Bluesky Jetstream Scraper is a tool built for Apify to collect and analyze real-time data from the Bluesky social network using the ATProto Firehose (Jetstream). This scraper allows you to filter posts by various criteria and customize the output format.

🔄 Jetstream vs. Crawling: This scraper uses Bluesky's Jetstream (firehose) API, which provides a continuous stream of real-time data directly from Bluesky's servers. Unlike traditional crawling methods that make numerous API requests to gather posts (which face rate limits and higher resource usage), the Jetstream approach is more efficient, providing access to the full stream of content as it's created without the limitations of crawling individual endpoints. This makes it ideal for large-scale data collection, trend analysis, and real-time monitoring.

⚠️ Real-Time vs. Historical: The Jetstream approach is designed for collecting current, real-time data only and is not suitable for historical data collection or analyzing posts over extended periods of time. It captures the content stream as it happens but cannot access posts from the past. If you need historical data analysis or content from specific time periods in the past, you would need to use different methods such as the Bluesky Query API (with appropriate rate limiting).

📣 Platform Notice: It's important to note that Bluesky and its API infrastructure are still evolving platforms. API specifications, data formats, and endpoints may change over time. While we strive to keep this scraper up-to-date with any platform changes, users should be aware that occasional updates may be necessary to maintain compatibility as the Bluesky ecosystem continues to develop.


📋 Input Schema Parameters

This section describes in detail how each input parameter affects the behavior of the scraper and the resulting output.

🔍 Filtering Parameters

hashtags

  • Type: Array of strings
  • Description: A list of hashtags to filter posts by (without the # symbol)
  • Behavior: The scraper will only collect posts that contain at least one of the specified hashtags. When multiple hashtags are provided, posts matching ANY of these hashtags will be included (OR logic).
  • Example: If you set ["apify", "scraping"], the output will include all posts containing either #apify OR #scraping.

usernames

  • Type: Array of strings
  • Description: A list of Bluesky usernames to filter posts by (will be resolved to DIDs for efficient filtering)
  • Behavior: The scraper will only collect posts authored by the specified users. When multiple usernames are provided, posts from ANY of these users will be included (OR logic).
  • Example: If you set ["user1.bsky.social", "user2.bsky.social"], the output will include all posts from either user1 OR user2.

languages

  • Type: Array of strings
  • Description: Languages to filter posts by (multiple selection allowed)
  • Behavior: The scraper will only collect posts in the specified languages. When multiple languages are provided, posts in ANY of these languages will be included (OR logic). If a post doesn't have a language field, the scraper can auto-detect its language (if detectLanguage is enabled).
  • Example: If you set ["en", "pt"] (English and Portuguese), the output will include all posts in either English OR Portuguese.

wantedCollections

  • Type: Array of strings
  • Description: Specific Bluesky collections to filter from Jetstream (defaults to feed posts)
  • Behavior: Controls what types of content are collected from the Bluesky firehose. Options include:
    • app.bsky.feed.post: Regular posts
    • app.bsky.feed.like: Like interactions
    • app.bsky.feed.repost: Repost interactions
    • app.bsky.graph.follow: Follow relationships
    • app.bsky.graph.block: Block relationships
    • app.bsky.actor.profile: Profile updates
  • Example: If you set ["app.bsky.feed.post", "app.bsky.feed.repost"], the output will include both original posts AND reposts.

📊 Content Inclusion Parameters

includeMedia

  • Type: Boolean
  • Description: Whether to include URLs for media attachments
  • Behavior: When set to true, the output will include media URLs from posts. When set to false, media URLs will be excluded, and mediaUrl, mediaThumbnailUrl fields will be empty, hasMedia will be false, and mediaCount will be 0.
  • Example: If set to false with a language filter of ["pt"], the output will include Portuguese-language posts but without any media URLs or media-related fields populated.

includeImages

  • Type: Boolean
  • Description: Whether to include URLs for images in the output
  • Behavior: When set to true, the output will include image URLs from posts. When set to false, image URLs will be excluded, and imageUrl field will be empty, and hasImages will be false.
  • Example: If set to false, posts with images will still be included in the output, but image URLs won't be extracted or included in the result fields.

includeReplies

  • Type: Boolean
  • Description: Whether to include reply information in collected posts
  • Behavior: When set to true, the output will include information about which posts are replies, and to which posts they are replying. When set to false, this information will be excluded.
  • Example: If set to true, posts that are replies will have isReply set to true, along with replyToRoot and replyToParent fields containing the URIs of the root and parent posts.

🗣️ Language Settings

detectLanguage

  • Type: Boolean
  • Description: Whether to automatically detect the language of posts that don't specify one
  • Behavior: When set to true, the scraper will use language detection to determine the language of posts that don't include language metadata. This is particularly useful when filtering by language. When set to false, posts without language metadata will not match any language filter.
  • Example: If filtering for Japanese posts and this is set to true, posts without explicit language metadata might still be included if they contain Japanese text.

👤 User Profile Settings

enrichUserProfiles

  • Type: Boolean
  • Description: Whether to fetch additional user profile information for post authors
  • Behavior: When set to true, the output will include extended information about post authors, such as their description, follower/following counts, post counts, and avatar URLs. When set to false, only basic author information (DID, handle, name) will be included.
  • Example: If set to true, each post in the output will include additional fields like authorDescription, authorFollowersCount, etc.

⏱️ Data Collection Parameters

maxPosts

  • Type: Integer
  • Description: Maximum number of posts to collect (0 for unlimited)
  • Behavior: Controls how many posts will be collected before the scraper stops. Setting to 0 means the scraper will continue until the time limit is reached.
  • Example: If set to 100, the scraper will stop after collecting 100 posts that match the filter criteria.

timeLimit

  • Type: Integer
  • Description: Maximum time to run the scraper in minutes
  • Behavior: Controls how long the scraper will run before stopping, regardless of how many posts have been collected.
  • Example: If set to 30, the scraper will stop after 30 minutes, even if it hasn't reached the maxPosts limit.

🔌 Connection Settings

region

  • Type: String enum ("us-east" or "us-west")
  • Description: Region for the Jetstream server
  • Behavior: Controls which regional Bluesky Jetstream server the scraper connects to. This can affect latency and potentially the volume of data received.
  • Example: If you're collecting data from the US West Coast, selecting us-west might provide lower latency.

instance

  • Type: Integer (1 or 2)
  • Description: Instance number for the Jetstream server
  • Behavior: Selects which specific Jetstream instance to connect to within the selected region.
  • Example: If experiencing connection issues with instance 1, switching to instance 2 might help.

autoReconnect

  • Type: Boolean
  • Description: Whether to automatically reconnect if the connection is lost
  • Behavior: When set to true, the scraper will attempt to reconnect to Jetstream if the connection drops. When set to false, the scraper will terminate on connection loss.
  • Example: For long-running data collection jobs, setting this to true helps ensure continuous data collection despite temporary network issues.

maxRetries

  • Type: Integer
  • Description: Maximum number of reconnection attempts
  • Behavior: Controls how many times the scraper will try to reconnect before giving up.
  • Example: If set to 5, the scraper will make up to 5 reconnection attempts before terminating.

⚙️ Advanced Settings

saveCheckpoints

  • Type: Boolean
  • Description: Whether to periodically save collected data to prevent loss on errors
  • Behavior: When set to true, the scraper will periodically save collected data to disk, allowing recovery from a checkpoint if the process is interrupted.
  • Example: If set to true and the scraper crashes after collecting 400 posts, you might be able to recover 350 of them from the last checkpoint.

proxy

  • Type: Object
  • Description: Proxy configuration for the scraper
  • Behavior: Controls whether and how the scraper uses Apify proxies for connections.
  • Example: Setting useApifyProxy to true allows the scraper to use Apify's proxy infrastructure, which can help avoid rate limiting.

debugMode

  • Type: Boolean
  • Description: Whether to enable detailed logging for troubleshooting
  • Behavior: When set to true, the scraper will output more detailed logs about its operation, which can help diagnose issues.
  • Example: If you're not seeing the expected output, setting this to true can provide insights into what's happening.

verboseDebug

  • Type: Boolean
  • Description: Whether to enable extremely detailed logging for message format diagnostics
  • Behavior: When set to true, the scraper will output extremely detailed logs, including raw message contents. This generates large log files.
  • Example: Useful only for advanced debugging when developing or modifying the scraper.

🎨 Customizing Output Format

The scraper allows you to customize the data fields included in the output through several parameters:

Field Selection Controls

These parameters control which data fields are included in the output:

  • includeMedia: Controls whether media URLs and related fields are included
  • includeImages: Controls whether image URLs and related fields are included
  • includeReplies: Controls whether reply information fields are included
  • enrichUserProfiles: Controls whether extended author profile fields are included

Output Format Options

On the Apify platform, you can download your dataset in several formats:

  1. JSON: The default format with complete data structure
  2. CSV: Tabular format suitable for spreadsheet applications
  3. Excel: Direct Excel file download
  4. RSS: For feed readers
  5. HTML: For web viewing

To change the download format:

  1. Navigate to the "Storage" tab in your Apify account
  2. Select the dataset from your actor run
  3. Click the "Download" dropdown menu
  4. Choose your preferred format

For customized data processing, you can also use the Apify API to retrieve the data programmatically in your preferred format.


🔄 Combining Filters

When multiple filter types are used together (hashtags, usernames, languages), the scraper applies AND logic between different filter types:

  • If you set both hashtags and languages, posts must match BOTH criteria (contain one of the hashtags AND be in one of the languages).
  • If you set both usernames and languages, posts must be authored by one of the specified users AND be in one of the specified languages.

⚪ Default Behavior (No Filters)

When no filter options (hashtags, usernames, languages) are selected:

  • The scraper will collect all posts from the Bluesky Jetstream without any filtering
  • All posts will match the filter criteria automatically
  • The only limits will be the maxPosts parameter and/or the timeLimit parameter
  • You'll get a diverse, unfiltered stream of Bluesky content
  • Other inclusion settings like includeMedia and includeImages will still be applied
  • Collection types will be limited to what's specified in wantedCollections (defaults to feed posts)

This approach is useful for general data collection when you want to analyze the overall Bluesky content without focusing on specific topics, users, or languages.


📝 Example Scenarios


🤝 Bluesky Firehose Scraping Etiquette

When using the Bluesky Jetstream (firehose), it's important to follow these ethical guidelines and best practices:

📜 Official Guidelines

  • Respect the Terms of Service: Always adhere to Bluesky's official Terms of Service and API Usage Guidelines.
  • Attribution: When publishing research or analysis based on Bluesky data, properly attribute the source.
  • Privacy Awareness: Though the data is publicly available, be mindful that users may not expect their content to be analyzed at scale.

🔧 Technical Best Practices

  • Rate Limiting: The scraper already implements rate limiting, but be cautious about running multiple instances simultaneously.
  • Efficient Filtering: Use the filtering options to collect only the data you need rather than scraping everything.
  • Connection Management: Use the autoReconnect and maxRetries settings responsibly to avoid creating excessive connection attempts.
  • Data Storage: Handle collected data securely and in compliance with relevant privacy regulations like GDPR.

🔍 Responsible Usage

  • Research Purpose: Clearly define your research or business purpose before collecting data.
  • Minimize Collection: Only collect the data fields necessary for your analysis.
  • Respect Boundaries: Avoid excessive scraping that might impact the platform's performance.
  • Consider Opt-Out: When presenting results, consider providing ways for users to opt-out of having their content included.
  • Data Protection: Comply with applicable data protection laws in your jurisdiction.
  • User Privacy: Even though posts are public, respect user privacy by anonymizing data when possible.
  • Terms Changes: Regularly check for updates to Bluesky's terms as the platform is evolving.

Following these guidelines ensures ethical use of the Bluesky firehose while maintaining a positive relationship with the platform and its community.

Pricing

Pricing model

Rental 

To use this Actor, you have to pay a monthly rental fee to the developer. The rent is subtracted from your prepaid usage every month after the free trial period. You also pay for the Apify platform usage.

Free trial

1 day

Price

$25.00