|
|
|
|
|
by Smerity
3552 days ago
|
|
This is the tactic used by the CNN / Dailymail dataset[1] released by Google Deepmind. In that situation you want everyone to have the same and original file. The contents of the URL may be updated and/or disappear. The original URLs are recorded but they're retrieved from the Wayback Machine. Even with that a few of the pages are still essentially lost however - or at least I was never able to retrieve them. Kyunghyun Cho hosts a copy of the processed data on his site at NYU - who may be less likely to receive legal requests compared to Google or a similar commercial company hosting them. Distribution of datasets is going to be a continued painpoint for machine learning in the future. No-one wants to be the legal guinea pig for exactly what entails fair use in ML. [1]: https://github.com/deepmind/rc-data |
|
P.S.: Also, archive.org retroactively removes (or delists?) site content when blocked by robots.txt, as domain parkers often do, so I could see that causing trouble down the line. https://archive.org/post/1019415/retroactive-robotstxt-remov...