Hacker News new | ask | show | jobs
by smcin 557 days ago
That's what I just concluded. I think the OP was oversold on the idea of using AI to do scraping, NLP and summarization, all in one go.
1 comments

Best practice (for many reasons) is to separate scraping (and OCR) and store the rawtext or raw HTML/JS, and also the parsed intermediate result (cleaned scraped text or HTML, with all the useless parts/tags removed). This is then the input to the rest of the pipeline. You really want to separate those, both for minimizing costs, and preventing breakage when site format changes, anti-scraping heuristics change, etc. And not exposing garbage tags to AI saves you time/money.