Hacker News new | ask | show | jobs
Show HN: I scraped 3.2B TikTok profiles and 9B posts to build this search engine (app.seeksocial.io)
22 points by IWantAllTheData 708 days ago
B = Billion

Aiming to scrape 1 Trillion TikTok posts by this year.

7 comments

Wow this is pretty amazing... I imagine your proxy bill must be pretty huge! I always wondered how companies like Clearview scraped Instagram etc at scale. Do you add user to a queue, get all of that users posts, then add everyone they follow/everyone that is following them to the queue, and repeat? With Twitter I know from experience you can predict what the next snowflake IDs will be so in theory you could enumerate the whole site. If I recall correctly Tiktok has a similar ID scheme but I think people weren't able to figure out what some of the last bits represented.
Thank you. Yes proxy bill is quite high!

Yes you are right, the ID enumeration method doesn't work with TikTok.

Filtering works pretty well, and the design is well executed. For those who complain about scraping, if this is public data, as the OP mentioned, then I don't see how it is different from Google.

A clean search would benefit the creators, and give them more visibility. The first thing I had in mind was searching by keyword and number of followers to see which ones would fit a startup to use influencer marketing. Imagine doing that manually on TikTok or through Google.

I appreciate the positive comment. Thanks!

Fun fact: I did the UI Design, frontend/backend web development, database, servers, scraping and everything else by myself. It took me few months but i guess its worth it.

Maybe this is because I don't use TikTok but I don't know what "Eng. Rate (%)" means, so I'm not sure that something you should appreciate.
Thanks for letting me know. I will choose a better wording.

Eng. Rate (%) is Engagement Rate. Which is a common metric marketers look into when hiring an influencer.

This is awesome! Quick questions:

How do you handle the massive data scale and keep it all up-to-date? Have you faced any ethical challenges with scraping so much public data?

Could you share how you managed to scrape so much data? I'm especially curious about how you got around rate limits etc.
I use proxy.
Pretty cool, always wanted to do a large scale scraping project like this but never really got around to it.

What's your tech stack?

All the scrapers are in Python but im slowly transferring it to Golang, i find it much faster.

The website is NextJS.

I'm here for this. If only to highlight the amount of data leaked willingly. That said, I'd there a breach of service doing this?
All the data were scraped without creating a TikTok account, AKA public data. Google.com has scraped more TikTok profiles and videos than me.

In fact, one the way i did it is by scraping google search results itself.

It may not be quite that simple, see hiQ Labs vs LinkedIn[1].

[1] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

That's not about public data. Reference [7] talked about creating fake accounts to scrape.

> hiQ had prevailed on the Computer Fraud and Abuse Act (CFAA) “unauthorized access” issue related to public website data but was facing a ruling that it had breached LinkedIn’s User Agreement due to its scraping and creation of fake accounts (subject to its equitable defenses).

https://natlawreview.com/article/hiq-and-linkedin-reach-prop...

So many useful stuffs in the comments. Thank you, will give this a read as well.
Will check it out thanks. I'm always ready to take down the site if TikTok reach out me to.
There will be some backlash for this, it can seem nefarious when presented this way. That said, can we see your work?
I didn't even think this could be controversial before reading the comments. I always assume what I post on social media is public.

Do others have different assumptions? Why would people get angry at a random person collecting data they published?

Yes, I did receive some backlash, which is understandable. However, this is a very useful tool for content creators to find out what's trending on TikTok. It is also used by brands to analyze market sentiment about them.

I can't share the specifics, but i can explain the methodology in private. My contacts are on the website.