Context length + API cost is right now main bottleneck for huge HTML + CSS files. The extraction here is already quite efficient but still:
with past messages + system prompt + sometimes extracted text + extracted interactive elements you are quickly already around 2500 tokens (for gpt-4o 0.01$).
If you extract entire HTML and CSS your cost + inference time are quickly 10x.
Nope:
1280x1024 low resolution with gpt-4o are 85 tokens so approx $0.0002 (so 100x cheaper). For high resolution its apporx $0.002
https://openai.com/api/pricing/
I do this for my extension [0] but the HTML is often too large for context window sizes . I end up doing scraping of the relevant pieces before sending to LLM.
If you extract entire HTML and CSS your cost + inference time are quickly 10x.