|
|
|
|
|
by stuartaxelowen
1180 days ago
|
|
In my experience, the hard part is not extracting data from websites, but observing and implementing the actual structure of the site - e.g. iTunes categories have apps, which have reviews, etc, and making your scraper intelligent enough to make use of that structure to gather the freshest data efficiently. There is definitely a place for LLMs in solving this problem: in taking over for the human in interpreting the business goals/data to gather along with the available data on the web, but my experiments have shown that this is a significant problem due to limited LLM context length and difficulty distilling messy data. But, very excited to keep pushing, and seeing where things go :) Note: I build https://www.thoughtvector.io/pointscrape/ to solve very-large-scale web-data gathering problems like these. |
|
Structuring and categorising unknown content and it's taxonomies works astonishingly well with minimal configuration and used to be an extremely difficult problem.