|
|
|
|
|
by liamzebedee
2714 days ago
|
|
Wow, your product is super cool! I was going to post this in an Ask HN but maybe you would like to share instead - what is it like to architect a crawler for social media sites? With Twitter Music shutdown, I was looking back on their acquihire of WeAreHunted, a music ranking service which at its core was a crawler that indexed torrents/tumblr/soundcloud to find what's up and coming. As I was pondering this, I was thinking about how they would normalise the data. My main question was, how much difficulty do you encounter indexing a social site? I can imagine Tumblr, Facebook, and other sites have a plurality of new content appearing at arbitrary intervals (posts, comments, etc) - and I don't imagine that there are RSS feeds to diff here. So how would it function? |
|
Basically you have a link rank build out your crawl frontier and then you have an incremental ranking algorithm re-rank the frontier.
The problem is that latency is a major factor.
Many of the top social news posts start from some random user that witnesses something very interesting and then that shoots up rapidly.
Our goal basically has to be to index anything that's not spam and has a potential for being massive.
Additionally, a lot of the old school Google architecture applies. A lot of our infra is devoted to solving problems that would be insanely expensive to build out in the cloud.
We keep re-running the math but to purchase our infra on Amazon web services would be like 150-250k per month but we're doing it for about 12-15k per month.
It's definitely fun to have access to this much content though.
Additionally, our customers are brilliant and we get to work with the CTOs of some very cool companies which is always fun!