Hacker News new | ask | show | jobs
by plaidfuji 557 days ago
I’m not sure I would call this a failure.. more just something you tried out of curiosity and abandoned. Happens to literally everyone. “Failed” to me would imply there was something fundamentally broken about the approach or the dataset, or that there was an actual negative impact to the unrealized result. It’s very hard to finish long-running side projects that aren’t generating income, attention, or driven by some quasi-pathological obsession. The fact you even blogged about it and made HN front page qualifies as a success in my book.

> If I would have finished the project, this dataset would then have been released and used for a number of analyses using Python.

Nothing stopping you from releasing the raw dataset and calling it a success!

> Back then, I would have trained a specialised model (or used a pretrained specialised model) but since LLMs made so much progress during the runtime of this project from 2020-Q1 to 2024-Q4, I would now rather consider a foundational model wrapped as an AI agent instead; for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.

I actually just started (and subsequently —-abandoned—- paused) my own news analysis side project leveraging LLMs for consolidation/aggregation.. and yeah, the web scraping part is still the worst. And I’ve had the same thought that feeding raw HTML to the LLM might be an easier way of parsing web objects now. The problem is most sites are privy to scraping efforts and it’s not so much a matter of finding the right element but bypassing the weird click-thru screens, tricking the site that you’re on a real browser, etc…

2 comments

Personally, I think it's helpful to feel disappointment and insufficiency when those emotions pop up. They are the voices of certain preferences, needs, and/or desires that work to enrich our lives. Recontextualizing the world into some kind of positive success story can often gaslight those emotions out of existence, which can, paradoxically, be self-sabotoging.

The piece reads to me like a direct and honest confrontation with failure. It means the author thinks they can do better and is working to identify unhelpful subconscious patterns and overcome them.

Personally, I found the author's laser focus on "data science projects" intriguing. I have a tendency to immediately go meta which biases towards eliding detail; however, even if overly narrow, the author's focus does end up precipitating out concrete, actionable hypotheses for improvement.

Bravo, IMHO.

> Nothing stopping you from releasing the raw dataset and calling it a success!

Right. OP: release it as a Kaggle Dataset (https://www.kaggle.com/datasets) and invite people to collaboratively figure out how to autonate the analyses. (Do you just want to get sentiment on a specific topic (e.g. vaccination, German energy supplies, German govt approval)? or quantitative predictions?) Start with something easy.

> for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.

Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"

> and yeah, the web scraping part is still the worst.

Sounds wrong. OP, fix your scraping. (unless it was anti-AI heuristics that kept breaking it, which I doubt since it's Tagesschau). But Tagesschau has RSS feeds, so why are you blocked on scraping? https://www.tagesschau.de/infoservices/rssfeeds

Compare to: Kaggle Datasets "10k German News Articles for topic classification", Schabus, Skowron Trspp, SIGIR 2017 [https://www.kaggle.com/datasets/abhishek/10k-german-news-art...]

I'll put a shoutout for https://zenodo.org/ and https://figshare.com/ as places to put your data, where you'll get a DOI and can let someone that's not a company look after hosting and backing it up. Zenodo is hosted as long as CERN is around (is the promise) and figshare is backed by the CLOCKSS archive (multiple geographically distributed universities).
Right.

Google acquired Kaggle in 2017, and also Appen acquired Figure Eight (formerly CrowdFlower) in 2019, both of which used to be open-source-friendly places to post datasets for useful comments/analyses/crowdsourced hacking, in general without heavy and restrictive license terms. (There is also still the UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/). Kaggle still may be, just beware of the following:

Kaggle at some point began silently disappearing some (commercial) datasets from useful old competitions (such as dunnhumby's Shopping Challenge 2011 [0], even though it was anonymized and only had three features). So you can't rely on the more commercial datasets being around to cite and for replicability.

Also, according to [1] "you can be banned on Kaggle without any warnings or reasons, all your kernels and datasets will became inaccessible even for downloading for yourself and support will not answer you for weeks (if ever)". Usually IME I'd heard it's on (AI-based) suspicion of cheating (or using multiple accounts to bypass submission limits, or collusion between teams on submissions), or post-2018 gaming and account-warming/transfer to boost rankings. But the AI might do false-positives, and it's reportedly nearly impossible to reach live human support.

Kaggle added DOIs in 2019 [2], at least for academic datasets, not by default.

[0]: https://www.kaggle.com/c/dunnhumbychallenge

[1]: https://www.reddit.com/r/kaggle/comments/essuk1/reminder_you...

[2]: https://www.kaggle.com/discussions/product-feedback/108594