Hacker News new | ask | show | jobs
by its_down_again 592 days ago
Screenshots aren't as accurate or context-rich as HTML, but they let you bypass the hassle of building logic for permissions and authentication across different apps to pull in text content for the LLM.
1 comments

Can’t you just make a browser extension to haveaccess to the HTML and CSS, and use LLMs from that?
Context length + API cost is right now main bottleneck for huge HTML + CSS files. The extraction here is already quite efficient but still: with past messages + system prompt + sometimes extracted text + extracted interactive elements you are quickly already around 2500 tokens (for gpt-4o 0.01$).

If you extract entire HTML and CSS your cost + inference time are quickly 10x.

Aren't screenshots far larger than this?
Nope: 1280x1024 low resolution with gpt-4o are 85 tokens so approx $0.0002 (so 100x cheaper). For high resolution its apporx $0.002 https://openai.com/api/pricing/
Yeah. I noticed a very low cost when I run it via vm, predefined resolution. Good tip.
I do this for my extension [0] but the HTML is often too large for context window sizes . I end up doing scraping of the relevant pieces before sending to LLM.

[0] https://chromewebstore.google.com/detail/namebrand-check-for...