| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cpursley 592 days ago

This is really nice, especially for feeding LLMs web page data (they generally understand markdown well).

I built something similar for the Elixir world but it’s much more limited (I might borrow some of your ideas):

https://github.com/agoodway/html2markdown

2 comments

JohannesKauf 592 days ago

> built something similar for the Elixir

We interact with the web so much that it’s worth having such a library in every language... Great that you took the time and wrote one for the Elixir community!

Feel free to contact me if you want to ping-pong some ideas!

> feeding LLMs web page data

Exactly, that one use case that got quite popular. There is also the feature of keeping specific HTML tags (e.g. <article> and <footer>) to give the LLM a bit more context about the page.

link

jaggirs 592 days ago

Why not just give the html to the llm?

link

zexodus 592 days ago

Context size limits are usually the reason. Most websites I want to scrape end up being over 200K tokens. Tokenization for HTML isn't optimal because symbols like '<', '>', '/', etc. end up being separate tokens, whereas whole words can be one token if we're talking about plain text.

Possible approaches include transforming the text to MD or minimizing the HTML (e.g., removing script tags, comments, etc.).

link

dtjohnnyb 592 days ago

I was trying to do this recently for Web page summarization. As said below the token sizes would end up over the context length, so I trimmed the html to fit just to see what would happen. I found that the LLM was able to extract information, but it very commonly would start trying to continue the html blocks that had been left open in the trimmed input. Presumably this is due to instruction tuning on coding tasks

I'd love to figure out a way to do it though, it seems to me that there's a bunch of rich description of the website in the html

link

kgeist 592 days ago

I remember there was a paper which found that LLMs understand HTML pretty well, you don't need additional preprocessing. The downside is that HTML produces more tokens than Markdown.

link

simonw 592 days ago

Right: the token savings can be enormous here.

Use https://tools.simonwillison.net/jina-reader to fetch the https://news.ycombinator.com/ homepage as Markdown and paste it into https://tools.simonwillison.net/claude-token-counter - 1550 tokens.

Same thing as HTML: 13367 tokens.

link