| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jaggirs 589 days ago
	Why not just give the html to the llm?

3 comments

zexodus 589 days ago

Context size limits are usually the reason. Most websites I want to scrape end up being over 200K tokens. Tokenization for HTML isn't optimal because symbols like '<', '>', '/', etc. end up being separate tokens, whereas whole words can be one token if we're talking about plain text.

Possible approaches include transforming the text to MD or minimizing the HTML (e.g., removing script tags, comments, etc.).

link

dtjohnnyb 589 days ago

I was trying to do this recently for Web page summarization. As said below the token sizes would end up over the context length, so I trimmed the html to fit just to see what would happen. I found that the LLM was able to extract information, but it very commonly would start trying to continue the html blocks that had been left open in the trimmed input. Presumably this is due to instruction tuning on coding tasks

I'd love to figure out a way to do it though, it seems to me that there's a bunch of rich description of the website in the html

link

kgeist 589 days ago

I remember there was a paper which found that LLMs understand HTML pretty well, you don't need additional preprocessing. The downside is that HTML produces more tokens than Markdown.

link

simonw 589 days ago

Right: the token savings can be enormous here.

Use https://tools.simonwillison.net/jina-reader to fetch the https://news.ycombinator.com/ homepage as Markdown and paste it into https://tools.simonwillison.net/claude-token-counter - 1550 tokens.

Same thing as HTML: 13367 tokens.

link