|
|
|
|
|
by hubraumhugo
1180 days ago
|
|
Exactly, semantically understanding the website structure is only one challenge of many with web scraping: * Ensuring data accuracy (avoiding hallucination, adapting to website changes, etc.) * Handling large data volumes * Managing proxy infrastructure * Elements of RPA to automate scraping tasks like pagination, login, and form-filling At https://kadoa.com, we are spending a lot of effort solving each of these points with custom engineering and fine-tuned LLM steps. Extracting a few data records from a single page with GPT is quite easy. Reliably extracting 100k records from 10 different websites on a daily basis is a whole different beast :) |
|
The landing page does not provide nearly enough information on how it works in practice. Is it automated or is custom code written for each site?