| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hubraumhugo 659 days ago

We've been working on AI-automated web scraping at Kadoa[0] and our early experiments were similar to the those in the article. We started when only the expensive and slow GPT-3 was available, which pushed us to develop a cost-effective solution at scale.

Here is what we ended up with:

- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.

- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.

- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.

We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.

Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us

[0] https://kadoa.com

1 comments

artembugara 658 days ago

this! I've been following Kadoa since its very first days. Great team.

link