| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hubraumhugo 1180 days ago

Exactly, semantically understanding the website structure is only one challenge of many with web scraping:

* Ensuring data accuracy (avoiding hallucination, adapting to website changes, etc.)

* Handling large data volumes

* Managing proxy infrastructure

* Elements of RPA to automate scraping tasks like pagination, login, and form-filling

At https://kadoa.com, we are spending a lot of effort solving each of these points with custom engineering and fine-tuned LLM steps.

Extracting a few data records from a single page with GPT is quite easy. Reliably extracting 100k records from 10 different websites on a daily basis is a whole different beast :)

1 comments

ec109685 1178 days ago

Frustrating the only option to learn more is to book a demo and things like the API documentation are dead ends: https://www.kadoa.com/kadoa-api

The landing page does not provide nearly enough information on how it works in practice. Is it automated or is custom code written for each site?

link