|
|
|
|
|
by puglr
1175 days ago
|
|
As someone who has been doing the same thing recently, here's how I solved the issue where the page content has to be in the initial HTML. The first thing I did was fall back to a headless browser. Let it sit for 5 seconds to let the page render, then snatch the innerText. But 5-10% of sites do a good job of showing you the door for being a robot. I wanted to try and solve those cases by taking a screenshot of the page and using GPT-4 visual inputs, but when I got access I realized that 1) visual inputs aren't available yet and 2) holy crap is GPT-4 expensive. So instead what I do is give a screenshot service the url, get back a full-page PNG, then I hand that off to GCP Cloud Vision to OCR it. The OCRed text then gets fed into GPT-3.5 like normal. |
|
My intuition is that the structure information in the HTML would be useful to extract structured data.