Hacker News new | ask | show | jobs
by captainmuon 1084 days ago
Neat, I was just looking for something like this today, I think I'll give it a spin.

Does anybody here have experience with metadata extraction using LLMs? I've been thinking about it recently. and wonder if just making a big prompt and putting that into OpenGPT or even ChatGPT is really the way to go, or if there is a "cleverer" way. Maybe you could train specifically for certain fields, or use the LLM in a different way (like you can use the embeddings directly to do simularity search)?

Another idea was, if you have a lot of similar HTML documents, to not ask the LLM for the metadata, but to ask it for CSS selectors that contain the metadata fields - assuming it can deal with HTML and the data is verbatim in there. Then you should be able to get much more consistent results.

2 comments

We're using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using an LLM for every data extraction, as most comparable tools do, is expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.

Try it out https://kadoa.com

That is a very thought provoking use case and optimization for LLMs, thanks for sharing.
I gave it some css paths extracted from devtools, and some sample elements with data that needed extraction and had it write a beautiful soup + regex routine to do the extractions. worked fine. Also thousands of times faster.