Hacker News new | ask | show | jobs
by dankwizard 557 days ago
I don't speak the language so maybe what you're scraping isn't in this list, but why manual when they seem to have comprehensive RSS feeds? [1]

Automating this part should have been day 1.

[1] https://www.tagesschau.de/infoservices/rssfeeds

1 comments

That's what I just concluded. I think the OP was oversold on the idea of using AI to do scraping, NLP and summarization, all in one go.
Best practice (for many reasons) is to separate scraping (and OCR) and store the rawtext or raw HTML/JS, and also the parsed intermediate result (cleaned scraped text or HTML, with all the useless parts/tags removed). This is then the input to the rest of the pipeline. You really want to separate those, both for minimizing costs, and preventing breakage when site format changes, anti-scraping heuristics change, etc. And not exposing garbage tags to AI saves you time/money.