Hacker News new | ask | show | jobs
by dave_sullivan 3411 days ago
This is awesome and much needed.

Just throwing it out there, but would you consider making a dump of the data you scraped that could be used by data scientists? Maybe as a torrent or something like that? Data about movies and what people say about them could form the basis of a lot of NLP projects.

What other big datasets are there for forum post text data? The reddit dataset most immediately comes to mind, and I've also seen a similar one for HN comments. Any others?

2 comments

ArchiveTeam's web archives will be available to everyone without restrictions or profit as usual.
Well, thanks to your comment, I just found out ArchiveTeam exists.

Thank you!

Where you can download the boards?
Yeah that's definitely something I could do. Are there any proposed projects you know of that could benefit from this type of data?
Everything from simple sentiment analysis, to archive.org, to another mirror. I hope that does not discourage you from releasing the data.

Edit: I see the other comment about archive team already collecting and releasing this data, for free in an open format. I think that will be a good first source as well.

I think the priority was to reenable discussion of new movies and TV shows, that would be useful moving forward. But maybe they could make an API.
I'd love to explore how this data could be used to enhance recommender systems.
AFAIK, Jinni is doing exactly that.