Hacker News new | ask | show | jobs
by mbroecheler 5058 days ago
Hey,

- the data we used was crawled by Kwak et. al in 2009. We wanted to use a real social network dataset for the experiment and that was the largest/most useful one we could find. Other than de-duplication we did not make any modifications to the dataset, so the statistics reported in their paper still hold: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153....

- You mean what is the overhead induced by pre-computing the stream edge rather than collecting the relevant streams at query time? You are right that this requires a significant amount of storage space, however, as you also pointed out, this will get cold quickly and be sitting on disk only (i.e. not taking up space in the valuable cache). The reason this is very efficient is because of the time-based vertex centric index we build for the stream edges. This allows us to quickly pull out the most recent tweets for any user. If we had to compute those at query time, we would have to traverse to each person followed, get their 10 most recent tweets and then merge those in-memory. That would be significantly more expensive and since stream reading is probably the most frequent activity on twitter, pre-computing it saves a lot of time at the expense of inexpensive disk storage.