Hacker News new | ask | show | jobs
by TheTaytay 411 days ago
Does anyone know of a scraper that uses LLMs/natural language to build a deterministic, robust script that I can use to scrape the same site in the future? All of the natural language extractors I’ve seen so far need an LLM every time, but that seems unnecessary…
3 comments

llm-scraper [1] does a decent job but it's still a bit fragile. The biggest problem I have is all the React CSS-in-JS libraries that use hashes in their class names, which the LLM isn't smart enough to ignore.

[1] https://github.com/mishushakov/llm-scraper

What have you had success doing with this? Curious to test it
I mostly use it to aggregate event calendars for all the concert/sport/etc venues, meetups, and clubs in my area and do some other scraping tasks. I host a little wrapper around llm-scraper on a DigitalOcean droplet that I call from Val.town scripts

I only check most places once a week so I use the LLM to do the scraping but there are a few cases where I have to scrape thousands of pages very frequently so I use the more deterministic script it generates instead.

Oh Im interested in doing something similiar, is it hard to do?
Great thanks!
Nice! Thanks!
We’ve built one internally using browser-use to generate playwright code

Works ok. Not as automated as I’d like

they are all quite bad