| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tessro 5403 days ago
	It's good to see that in this day and age the government has still mastered the tried and true "print, sign, scan, email, then download via FTP" approach to file downloads.

2 comments

matt4711 5403 days ago

As someone who downloaded the corpus I can tell you that's not how it works. What you are downloading from nist is a java twitter html crawler and a list of tweet ids that you have to download directly from twitter.

It took me a week or more to download the complete 16 million tweets.

Another problem with the corpus is the fact that (at the time I downloaded the tweets) around 2% of the tweets in the corpus were no longer available from twitter as users deleted their twitter account. The longer you wait, the more tweets are going to be unavailable.

link

Someone 5403 days ago

It is worse. "in particular you agree ... to delete tweets that are marked deleted in the future"

Because of that, I do not see how anybody can even think about using this dataset for research.

If you keep all data, you are in breach of the license. If you do not, you are guaranteeing that you cannot reproduce your results in the future.

Als, there is the practical side. I would guess it takes a week to check for deleted tweets, too, so how are you going to comply with that clause?

link

Luyt 5403 days ago

I miss the 'fax' step ;-)

link