| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by maxdemarzi 5503 days ago

" To compute reach, you need to get all the people who tweeted the URL, get all the followers of all those people, unique that set of followers, and then count the number of uniques. It's an intense computation that potentially involves thousands of database calls and tens of millions of follower records."

Or you could use a Graph DB to solve a Graph problem.

URL -> tweeted_by -> users -> followed_by -> users

Try that on Neo4j.

1 comments

nathanmarz 5503 days ago

To do that query on Neo4j, you would need to store in memory on one machine the entire Twitter social graph, all the people who tweeted every URL ever tweeted on Twitter, and then do the computation on a single thread. Neo4j can't handle that scale.

The reach computation on Storm does everything in parallel (across however many machines you need to scale the computation) and gets data using distributed key/value databases (Riak, in our case).

link

herdrick 5503 days ago

Nathan, we'd love to hear your postmortem on BackType's experience with Neo4J, and how Sphinx is turning out.

link

nathanmarz 5503 days ago

We used Neo4j over a year ago, and it was pretty unstable when we used it. The database files were getting corrupted pretty frequently (a few times a week), so it just didn't work out for us. Ultimately it was for a small feature, so rather than continue to struggle with Neo4j we just reimplemented the feature using Sphinx. Like I said, that was a long time ago and Neo4j may have gotten a lot better since then.

link

herdrick 5503 days ago

OK, thanks! That's valuable info.

link