| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stuartaxelowen 1180 days ago

In my experience, the hard part is not extracting data from websites, but observing and implementing the actual structure of the site - e.g. iTunes categories have apps, which have reviews, etc, and making your scraper intelligent enough to make use of that structure to gather the freshest data efficiently.

There is definitely a place for LLMs in solving this problem: in taking over for the human in interpreting the business goals/data to gather along with the available data on the web, but my experiments have shown that this is a significant problem due to limited LLM context length and difficulty distilling messy data. But, very excited to keep pushing, and seeing where things go :)

Note: I build https://www.thoughtvector.io/pointscrape/ to solve very-large-scale web-data gathering problems like these.

1 comments

krsdcbl 1180 days ago

context limitations are an issue here, but this is definitely a usecase where LLMs can shine while other methods will quickly fail or need to be highly specific to their target.

Structuring and categorising unknown content and it's taxonomies works astonishingly well with minimal configuration and used to be an extremely difficult problem.

link