Hacker News new | ask | show | jobs
by namuorg 660 days ago
For structured content (e.g. lists of items, simple tables), you really don’t need LLMs.

I recently built a web scraper to automatically work on any website [0] and built the initial version using AI, but I found that using heuristics based on element attributes and positioning ended up being faster, cheaper, and more accurate (no hallucinations!).

For most websites, the non-AI approach works incredibly well so I’d make sure AI is really necessary (e.g. data is unstructured, need to derive or format the output based on the page data) before incorporating it.

[0] https://easyscraper.com

2 comments

The LLM is resistant to website updates that would break normal scraping

If you do like the author did and ask it to generate xPaths, you can use it once, use the xPaths it generated for regular scraping, then once it breaks fall back to the LLM to update the xPaths and fall back one more time to alerting a human if the data doesn't start flowing again, or if something breaks further down the pipeline because the data is in an unexpected format.

Xpath can be based on content, not only positions
I normally use query selectors for scraping, I'm not sure if that'd work better.
Unless the update is splitting cells.

> Turns out, a simple table from Wikipedia (Human development index) breaks the model because rows with repeated values are merged

Nah, still correct :-) that would break the regular scraping as well

>The LLM is resistant to website updates that would break normal scraping

This is absolutely true, but it does have to be weighed against the performance benefits of something that doesn't require invoking an LLM to operate.

If the cost of updating some xPath things every now and then is relatively low - which I guess means "your target site is not actively & deliberately obfuscating their website specifically to stop people scraping it"), running a basic xPath scraper would be maybe multiple orders of magnitude more efficient.

Using LLMs to monitor the changes and generate new xPaths is an awesome idea though - it takes the expensive part of the process and (hopefully) automates it away, so you get the benefits of both worlds.

yours is the first one see that allows to scrape by selecting directly what to scrape. I always wondered why there was no tool doing that.
I've seen another website like this that had this feature on hackernews but it was from a retrospective. These websites have the nasty habit of ceasing operations
It needs to be a library.