| Personally, this feels like the direction scraping should move into. From defining how to extract, to defining what to extract. But we're nowhere near that (yet). A few other thoughts from someone who did his best to implement something similar: 1) I'm afraid this is not even close to cost-effective yet. One CSS rule vs. a whole LLM. A first step could be moving the LLM to the client side, reducing costs and latency. 2) As with every other LLM-based approach so far, this will just hallucinate results if it's not able to scrape the desired information. 3) I feel that providing the model with a few examples could be highly beneficial, e.g. /person1.html -> name: Peter, /person2.html -> name: Janet. When doing this, I tried my best at defining meaningful interfaces. 4) Scraping has more edge-cases than one can imagine. One example being nested lists or dicts or mixes thereof. See the test cases in my repo. This is where many libraries/services already fail. If anyone wants to check out my (statistical) attempt to automatically build a scraper by defining just the desired results:
https://github.com/lorey/mlscraper |
Regarding 3 & 4:
Definitely take a look at the existing examples in the docs, I was particularly surprised at how well it handled nested dicts/etc. (not to say that there aren't tons of cases it won't handle, GPT-4 is just astonishingly good at this task)
Your project looks very cool too btw! I'll have to give it a shot.