It's good to see that in this day and age the government has still mastered the tried and true "print, sign, scan, email, then download via FTP" approach to file downloads.
As someone who downloaded the corpus I can tell you that's not how it works. What you are downloading from nist is a java twitter html crawler and a list of tweet ids that you have to download directly from twitter.
It took me a week or more to download the complete 16 million tweets.
Another problem with the corpus is the fact that (at the time I downloaded the tweets) around 2% of the tweets in the corpus were no longer available from twitter as users deleted their twitter account. The longer you wait, the more tweets are going to be unavailable.
It took me a week or more to download the complete 16 million tweets.
Another problem with the corpus is the fact that (at the time I downloaded the tweets) around 2% of the tweets in the corpus were no longer available from twitter as users deleted their twitter account. The longer you wait, the more tweets are going to be unavailable.