Hacker News new | ask | show | jobs
by smcin 557 days ago
> Nothing stopping you from releasing the raw dataset and calling it a success!

Right. OP: release it as a Kaggle Dataset (https://www.kaggle.com/datasets) and invite people to collaboratively figure out how to autonate the analyses. (Do you just want to get sentiment on a specific topic (e.g. vaccination, German energy supplies, German govt approval)? or quantitative predictions?) Start with something easy.

> for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.

Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"

> and yeah, the web scraping part is still the worst.

Sounds wrong. OP, fix your scraping. (unless it was anti-AI heuristics that kept breaking it, which I doubt since it's Tagesschau). But Tagesschau has RSS feeds, so why are you blocked on scraping? https://www.tagesschau.de/infoservices/rssfeeds

Compare to: Kaggle Datasets "10k German News Articles for topic classification", Schabus, Skowron Trspp, SIGIR 2017 [https://www.kaggle.com/datasets/abhishek/10k-german-news-art...]

1 comments

I'll put a shoutout for https://zenodo.org/ and https://figshare.com/ as places to put your data, where you'll get a DOI and can let someone that's not a company look after hosting and backing it up. Zenodo is hosted as long as CERN is around (is the promise) and figshare is backed by the CLOCKSS archive (multiple geographically distributed universities).
Right.

Google acquired Kaggle in 2017, and also Appen acquired Figure Eight (formerly CrowdFlower) in 2019, both of which used to be open-source-friendly places to post datasets for useful comments/analyses/crowdsourced hacking, in general without heavy and restrictive license terms. (There is also still the UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/). Kaggle still may be, just beware of the following:

Kaggle at some point began silently disappearing some (commercial) datasets from useful old competitions (such as dunnhumby's Shopping Challenge 2011 [0], even though it was anonymized and only had three features). So you can't rely on the more commercial datasets being around to cite and for replicability.

Also, according to [1] "you can be banned on Kaggle without any warnings or reasons, all your kernels and datasets will became inaccessible even for downloading for yourself and support will not answer you for weeks (if ever)". Usually IME I'd heard it's on (AI-based) suspicion of cheating (or using multiple accounts to bypass submission limits, or collusion between teams on submissions), or post-2018 gaming and account-warming/transfer to boost rankings. But the AI might do false-positives, and it's reportedly nearly impossible to reach live human support.

Kaggle added DOIs in 2019 [2], at least for academic datasets, not by default.

[0]: https://www.kaggle.com/c/dunnhumbychallenge

[1]: https://www.reddit.com/r/kaggle/comments/essuk1/reminder_you...

[2]: https://www.kaggle.com/discussions/product-feedback/108594