Hacker News new | ask | show | jobs
by shopvaccer 999 days ago
What kind of websites? You mean like social media sites that are obfuscated to prevent scraping? I suppose, it would have to be quite reliable.

I don't know how relevant this is, but I was thinking that you could probably use some sort of AI to enhance OCR and convert written documents into some sort of semantic form like HTML or Latex. That would allow you to use books to scrape information, and written books still have a lot of untapped knowledge.

It seems like the demand for web scraping and such is to create datasets for ML training. And now you are using AI for scraping. So it is sort of a self-improving cycle

1 comments

Not specifically social media sites, getting through prevention would be difficult and there are already a lot of existing companies working on scraping popular social media sites.

Interesting idea, we're definitely looking into coupling OCR and LLMs today but not for that particular case. I think raw language models with a good workflow are typically good enough to extract structured data from things like books

ML training is definitely one area we can see this being useful. General data aggregation across a large industry (clothing, retail, etc) is something we want to look into. Also RPA style workflows involving multi-click actions across a variety of sites