TikTok Scraper
4 days trial then $45.00/month - No credit card required now
TikTok Scraper
4 days trial then $45.00/month - No credit card required now
Extract data from TikTok videos, hashtags, and users. Use URLs or search queries to scrape TikTok profiles, hashtags, posts, URLs, shares, followers, hearts, names, video, and music-related data. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.
Do you want to learn more about this Actor?
Get a demoUsing the ApifyClient module sdk, we are seeing an issue in which the value of the total
field returned by the dataset.listItems()
method does not align with the number of posts actually returned for paginated results.
The following actor options:
1const actorOptions = { 2 profiles: [userId], 3 resultsPerPage: maxPosts, 4 excludePinnedPosts: true, 5 oldestPostDate: undefined, 6 proxyCountryCode: "None", 7 waitSecs: 60 * 60, // 1 hour 8 }; 9const apifyRunResult = await apifyClient 10 .actor(apifyActor) 11 .call(actorOptions); 12 const dataset = await apifyClient.dataset( 13 apifyRunResult.defaultDatasetId, 14 ); 15 const rawResponse = await dataset.listItems({ 16 limit, // page size 17 offset, // item offset 18 }); 19... 20const {item, count , total} = rawResponse;
produces inconsistent results. Sometimes the total
response reflects the actual count and sometimes it reports a number like 30
even tho total number of items
returned will by more than 30.
The total
value is important to determine when the pagination has completed.
Hi! Thanks for patience,
I've tried to reproduce it, but couldn't, I always get total
to be the exact number of dataset items. For reference, here is my reproduction:
1import { ApifyClient } from 'apify-client'; 2 3const client = new ApifyClient({ 4 token: 'mytoken', 5}); 6 7const actorOptions = { 8 profiles: ["muslim"], 9 resultsPerPage: 1000, 10 excludePinnedPosts: true, 11 oldestPostDate: undefined, 12 proxyCountryCode: "None", 13 waitSecs: 60 * 60, // 1 hour 14}; 15const apifyRunResult = await client 16 .actor('GdWCkxBtKWOsKjdch') 17 .call(actorOptions); 18const dataset = client.dataset( 19 apifyRunResult.defaultDatasetId, 20); 21 22const itemCount = await dataset.get().then((response) => response.itemCount); 23for (let i = 0; i < itemCount; i += 30) { 24 const rawResponses = await dataset.listItems({ 25 limit: 30, // page size 26 offset: i, // item offset 27 }); 28 if (rawResponses.total !== itemCount) { 29 process.exit(1); 30 } 31 console.log(rawResponses.total); 32}
Could you send me your workflow's exact code (with tokens redacted), so I check further? It could be the case that after finishing the run, the dataset count hasn't had time to be updated, so it's a sort of a rare race condition
Hi, thanks for looking into this. I wasn't aware of the approach you used to get the itemCount
- that could be a useful workaround for us.
Here is the code we use the paginate the results. As you can see, we took another approach to workaround the issue we are having with the total
property. However, the problem we were having previsouly is that the total
value would sometimes change with multiple listItems()
calls.
1export const paginateFeedDataset = async (dataset, maxPosts, log, stats) => { 2 let hasMore = true; 3 const limit = 20; // apify api page size 4 let offset = 0; // the offset of the next page 5 let total = 0; 6 let totalItemsFetched = 0; 7 const items = []; 8 9 while (hasMore) { 10 const rawResponse = await dataset.listItems({ 11 limit, // page size 12 offset, // item offset 13 }); 14 const {items: pageItems, count = 0} = rawResponse; 15 16 // not currently in use 17 total = rawResponse.total; 18 19 log.info(`Fetched batch of ${pageItems.length} posts from Apify.`); 20 21 totalItemsFetched += count; 22 offset = totalItemsFetched; 23 items.push(...pageItems); 24 25 hasMore = totalItemsFetched < maxPosts && count > 0; 26 } 27 // Check if account is private 28 if (items.length === 1) { 29 if (items[0]?.authorMeta?.privateAccount) { 30 const msg = `${items[0].authorMeta.name} is a private account.`; 31 // log.error(msg); 32 throw new Error(msg); 33 } 34 } 35 stats.count("total", totalItemsFetched); 36 return items; 37};
The initial call to the apify Client is made using this snippet. I am not able to post the entirety of our implementation. But let me know if there is any other key info you need:
1const actorOptions = { 2 profiles: [userId], 3 resultsPerPage: maxPosts, 4 excludePinnedPosts: false, 5 oldestPostDate: dateString, 6 proxyCountryCode: "None", 7 waitSecs: 60 * 60 * 24, // 24 hours 8 }; 9 10 // get the user's feed. apify will return the history of maxQueries posts, starting at most recent. 11 const apifyRunResult = await apifyClient 12 .actor(apifyActor) 13 .call(actorOptions);
Hi! Again, thanks for the patience!
I've managed to reproduce it and forwarded to our tooling team. In the meantime, I'd recommend to use the one endpoint you've said you could use as a workaround, and if you can add some timeout before first querying it (like sleep(5000)
to wait for 5 seconds), so that the servers have time to update the count
UPD.: the tooling team said there is a known lag in the update of this count, and recommend to wait for 10 seconds before querying it for now
UPD2: The platform team said that "Updates to dataset obejct are throttled in API. So dataset stats may change even after actor has finished its run." and apparently they won't fix it. So I'd recommend not to rely on the total count too much, or, again, wait for ~10 seconds before querying for it
Let me know if it's still a problem for you by reopening the issue
Actor Metrics
1k monthly users
-
134 stars
>99% runs succeeded
5.8 days response time
Created in Sep 2021
Modified 8 days ago