Hacker News new | ask | show | jobs
by matt4711 5404 days ago
As someone who downloaded the corpus I can tell you that's not how it works. What you are downloading from nist is a java twitter html crawler and a list of tweet ids that you have to download directly from twitter.

It took me a week or more to download the complete 16 million tweets.

Another problem with the corpus is the fact that (at the time I downloaded the tweets) around 2% of the tweets in the corpus were no longer available from twitter as users deleted their twitter account. The longer you wait, the more tweets are going to be unavailable.

1 comments

It is worse. "in particular you agree ... to delete tweets that are marked deleted in the future"

Because of that, I do not see how anybody can even think about using this dataset for research.

If you keep all data, you are in breach of the license. If you do not, you are guaranteeing that you cannot reproduce your results in the future.

Als, there is the practical side. I would guess it takes a week to check for deleted tweets, too, so how are you going to comply with that clause?