Linkedin Posts Informations Scraper avatar

Linkedin Posts Informations Scraper

Try for free

3 days trial then $30.00/month - No credit card required now

Go to Store
Linkedin Posts Informations Scraper

Linkedin Posts Informations Scraper

saswave/linkedin-posts-informations-parser
Try for free

3 days trial then $30.00/month - No credit card required now

Scrape linkedin posts from linkedin post search results, url post or linkedin member. Supports advanced linkedin search filters. Extract posts data at scale.

OP

Broken UTF-8 encoding

Closed

openjoy opened this issue
a month ago

Hello. There's another issue we found and it's a bit weird. We see some issues with text encoding (i.e. post contents). Some unicode characters are represented incorrectly, mostly emojis but also some punctuation marks and even non-breakable spaces. But it's not like 100% of unicode chars are broken. Some emojis, for example, are represented correctly. We tried using different tools/libs to fix the encoding but without success. And we see the broken chars already in APIFY datasets so our guess is that the issue is somewhere in the actor (or surrounding libs/infra). Could you please take a look?

Example input:

1{
2  "cookies": [...],
3  "days_since_post": 14,
4  "max_posts": 0,
5  "url_search": "https://www.linkedin.com/in/danielmoka/"
6}

Example post: https://www.linkedin.com/posts/danielmoka_50-off-black-friday-deal-on-learning-activity-7267795633724407808-epJX

What was scraped (copied from APIFY web UI but we see the same picture from other tools): What you’ll get: • 4+ hours of hands-on 𝐯𝐢𝐝𝐞𝐨 𝐭𝐮𝐭𝐨𝐫𝐢𝐚𝐥𝐬 on TDD • A 𝐓𝐃𝐃 𝐞-𝐛𝐨𝐨𝐤 packed with 10+ years of experience • Pro tips on mastering 𝐭𝐞𝐬𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐫𝐞𝐟𝐚𝐜𝐭𝐨𝐫𝐢𝐧𝐠 • 3 𝐫𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 projects written in C#/.NET

K

saswave avatar

Added to the todo for tomorrow morning , But we found that text saved in apify dataset isn't always the same encoding as the one you print for logs before saving (if this makes sense to you)

We will check if it's code related or apify related when we push to the storage

OP

openjoy

a month ago

Thank you for quick response, as always. I understand that the encoding can get broken in various places. Admittedly, we haven't checked the run logs to investigate. In the end, we, of course, just want to grab a dataset file so I hope this can be fixed. I don't think we had this issue with other actors but it could be because the results there were within ASCII charset.

If it's any help, one low-level example of broken encoding: This character "♻️" was encoded as C3 A2 C2 99 C2 BB C3 AF C2 B8 C2 8F instead of E2 99 BB EF B8 8F. My not very educated guess is that this could be double-encoding but I don't think this explains all the symptoms.

OP

openjoy

a month ago

I just checked and it's indeed double-encoding. We can probably do some post-processing on our side but it's better to fix the root cause of course.

saswave avatar

We have updated the actor, have a try

Probably related to the way we were decoding linkedin text content, we removed the decoding step and return what linkedin returns

OP

openjoy

a month ago

Thank you for looking into it. I tried re-running one of the tasks with the latest build (0.0.119) and the issue still persists unfortunately. Seems to be on the same level as before.

saswave avatar

If it didn't help we can't do much.

At this point we return what linkedin returns us as content

I know the encoding in dataset is not always the same as the data we initial want to save (we faced this kind of issue with another actor)

Do you want us to handle unicode cleaning (ignore out of scope char / emojis ) ?

OP

openjoy

a month ago

My assumption was that this could be some configuration issue on the http client, headless browser (if that's how this works), crawling library, etc. The content returned from LinkedIn to the browser seems to be correct (I double-checked the binary representation) so if you're saying the response was incorrect, could be worth investigating the differences in how the data is requested. Or, like you said, if the issue is with dataset then maybe APIFY devs can suggest something.

In any case, while this is not ideal, we can implement an ugly workaround on our side. Please don't filter out the content as the data is still there. Also, I was wrong at calling the encoding "broken". It's perfectly valid UTF-8, just with some double encoding here and there. Filtering it would be as difficult as fixing the data.

Developer
Maintained by Community

Actor Metrics

  • 36 monthly users

  • 20 stars

  • 93% runs succeeded

  • 21 hours response time

  • Created in Oct 2023

  • Modified a month ago