| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by puglr 1221 days ago

As someone who has been doing the same thing recently, here's how I solved the issue where the page content has to be in the initial HTML.

The first thing I did was fall back to a headless browser. Let it sit for 5 seconds to let the page render, then snatch the innerText.

But 5-10% of sites do a good job of showing you the door for being a robot.

I wanted to try and solve those cases by taking a screenshot of the page and using GPT-4 visual inputs, but when I got access I realized that 1) visual inputs aren't available yet and 2) holy crap is GPT-4 expensive.

So instead what I do is give a screenshot service the url, get back a full-page PNG, then I hand that off to GCP Cloud Vision to OCR it. The OCRed text then gets fed into GPT-3.5 like normal.

2 comments

geysersam 1220 days ago

I haven't tried this myself yet. But I'm surprised you didn't find it beneficial to pass the raw HTML to the chatbot (potentially after some filtering). Did `innerText` give better results than `innerHTML`?

My intuition is that the structure information in the HTML would be useful to extract structured data.

link

puglr 1219 days ago

Great question. The problem with the raw HTML was token count. :)

A rather high percentage of pages are far too much for a GPT prompt!

link

elendee 1219 days ago

why oh why

link

puglr 1210 days ago

Heh, mostly as an experiment. I'd done a fair bit of scraping for some personal football apps over the past few years. Was curious about how GPT might be used when starting from first principles, as well as its abilities to solve specific challenges encountered with the traditional approach.

link