| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by namuorg 660 days ago

For structured content (e.g. lists of items, simple tables), you really don’t need LLMs.

I recently built a web scraper to automatically work on any website [0] and built the initial version using AI, but I found that using heuristics based on element attributes and positioning ended up being faster, cheaper, and more accurate (no hallucinations!).

For most websites, the non-AI approach works incredibly well so I’d make sure AI is really necessary (e.g. data is unstructured, need to derive or format the output based on the page data) before incorporating it.

[0] https://easyscraper.com

2 comments

sebstefan 659 days ago

The LLM is resistant to website updates that would break normal scraping

If you do like the author did and ask it to generate xPaths, you can use it once, use the xPaths it generated for regular scraping, then once it breaks fall back to the LLM to update the xPaths and fall back one more time to alerting a human if the data doesn't start flowing again, or if something breaks further down the pipeline because the data is in an unexpected format.

link

is_true 659 days ago

Xpath can be based on content, not only positions

link

sebstefan 659 days ago

I normally use query selectors for scraping, I'm not sure if that'd work better.

link

melenaboija 659 days ago

Unless the update is splitting cells.

> Turns out, a simple table from Wikipedia (Human development index) breaks the model because rows with repeated values are merged

link

sebstefan 659 days ago

Nah, still correct :-) that would break the regular scraping as well

>The LLM is resistant to website updates that would break normal scraping

link

trog 659 days ago

This is absolutely true, but it does have to be weighed against the performance benefits of something that doesn't require invoking an LLM to operate.

If the cost of updating some xPath things every now and then is relatively low - which I guess means "your target site is not actively & deliberately obfuscating their website specifically to stop people scraping it"), running a basic xPath scraper would be maybe multiple orders of magnitude more efficient.

Using LLMs to monitor the changes and generate new xPaths is an awesome idea though - it takes the expensive part of the process and (hopefully) automates it away, so you get the benefits of both worlds.

link

poulpy123 659 days ago

yours is the first one see that allows to scrape by selecting directly what to scrape. I always wondered why there was no tool doing that.

link

sebstefan 659 days ago

I've seen another website like this that had this feature on hackernews but it was from a retrospective. These websites have the nasty habit of ceasing operations

link

echelon 659 days ago

It needs to be a library.

link