|
|
|
|
|
by samwillis
1180 days ago
|
|
Scraping using LLMs directly is going to be really quite slow and resource intensive, but obviously quicker to get setup and going. I can see it being useful for quick ad-hock scrapes, but as soon as you need to scrape 10s or 100s thousands of pages it will certainly be better to go the traditional route. Using LLM to write your scrapers though is a perfect use case for them. To put it somewhat in context, the two types of scrapers currently are traditional http client based or headless browser based. The headless browsers being for more advanced sites, SPAs where there isn't any server side rendering. However headless browser scraping is in the order of 10-100x more time consuming and resource intensive, even with careful blocking of unneeded resources (images, css). Wherever possible you want to avoid headless scraping. LLMs are going to be even slower than that. Fortunately most sites that were client side rendering only are moving back towards have a server renderer, and they often even have a JSON blob of template context in the html for hydration. Makes your job much easier! |
|
I'm fast with Python scraping but for scraping one page ChatGPT was way, way faster. The biggest difference is it was quickly able to get the right links by context. The suit wasn't part of the link but was the header. In code I'd have to find that context and make it explicit.
It's a super simple html site, but I'm not exactly sure which direction that tips the balances.
[0]http://www.learntarot.com/cards.htm