Hacker News new | ask | show | jobs
by cpursley 592 days ago
This is really nice, especially for feeding LLMs web page data (they generally understand markdown well).

I built something similar for the Elixir world but it’s much more limited (I might borrow some of your ideas):

https://github.com/agoodway/html2markdown

2 comments

> built something similar for the Elixir

We interact with the web so much that it’s worth having such a library in every language... Great that you took the time and wrote one for the Elixir community!

Feel free to contact me if you want to ping-pong some ideas!

> feeding LLMs web page data

Exactly, that one use case that got quite popular. There is also the feature of keeping specific HTML tags (e.g. <article> and <footer>) to give the LLM a bit more context about the page.

Why not just give the html to the llm?
Context size limits are usually the reason. Most websites I want to scrape end up being over 200K tokens. Tokenization for HTML isn't optimal because symbols like '<', '>', '/', etc. end up being separate tokens, whereas whole words can be one token if we're talking about plain text.

Possible approaches include transforming the text to MD or minimizing the HTML (e.g., removing script tags, comments, etc.).

I was trying to do this recently for Web page summarization. As said below the token sizes would end up over the context length, so I trimmed the html to fit just to see what would happen. I found that the LLM was able to extract information, but it very commonly would start trying to continue the html blocks that had been left open in the trimmed input. Presumably this is due to instruction tuning on coding tasks

I'd love to figure out a way to do it though, it seems to me that there's a bunch of rich description of the website in the html

I remember there was a paper which found that LLMs understand HTML pretty well, you don't need additional preprocessing. The downside is that HTML produces more tokens than Markdown.
Right: the token savings can be enormous here.

Use https://tools.simonwillison.net/jina-reader to fetch the https://news.ycombinator.com/ homepage as Markdown and paste it into https://tools.simonwillison.net/claude-token-counter - 1550 tokens.

Same thing as HTML: 13367 tokens.