| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lorey 1180 days ago

Personally, this feels like the direction scraping should move into. From defining how to extract, to defining what to extract. But we're nowhere near that (yet).

A few other thoughts from someone who did his best to implement something similar:

1) I'm afraid this is not even close to cost-effective yet. One CSS rule vs. a whole LLM. A first step could be moving the LLM to the client side, reducing costs and latency.

2) As with every other LLM-based approach so far, this will just hallucinate results if it's not able to scrape the desired information.

3) I feel that providing the model with a few examples could be highly beneficial, e.g. /person1.html -> name: Peter, /person2.html -> name: Janet. When doing this, I tried my best at defining meaningful interfaces.

4) Scraping has more edge-cases than one can imagine. One example being nested lists or dicts or mixes thereof. See the test cases in my repo. This is where many libraries/services already fail.

If anyone wants to check out my (statistical) attempt to automatically build a scraper by defining just the desired results: https://github.com/lorey/mlscraper

4 comments

tomberin 1180 days ago

I was most worried about #2 but surprised how much temperature seems to have gotten that under control in my cases. The author added a HallucinationChecker for this but said on Mastodon he hasn't found many real-world cases to test it with yet.

Regarding 3 & 4:

Definitely take a look at the existing examples in the docs, I was particularly surprised at how well it handled nested dicts/etc. (not to say that there aren't tons of cases it won't handle, GPT-4 is just astonishingly good at this task)

Your project looks very cool too btw! I'll have to give it a shot.

link

polishdude20 1180 days ago

This seems like part of the problem we're always complaining about where hardware is getting better and better but software is getting more and more bloated so the performance actually goes down.

link

specproc 1180 days ago

Yeah, #1 just makes this seem pointless for the time being. The whole point of needing something like this is horizontal scaling.

Also not clear from my phone down the pub if inference is needed at each step. That would be slow, no? Even (especially?) if you owned the model.

link

tomberin 1180 days ago

No inference is needed. IME it can do a single page in ~10s, $0.01/page. Not practical for most use cases, great for a limited few right now.

link

sebzim4500 1180 days ago

Yeah seems like it would make way more sense to have an LLM output the CSS rules. Or maybe output something slightly more powerful, but still cheap to compute.

link