Context size limits are usually the reason. Most websites I want to scrape end up being over 200K tokens. Tokenization for HTML isn't optimal because symbols like '<', '>', '/', etc. end up being separate tokens, whereas whole words can be one token if we're talking about plain text.
Possible approaches include transforming the text to MD or minimizing the HTML (e.g., removing script tags, comments, etc.).
I was trying to do this recently for Web page summarization. As said below the token sizes would end up over the context length, so I trimmed the html to fit just to see what would happen.
I found that the LLM was able to extract information, but it very commonly would start trying to continue the html blocks that had been left open in the trimmed input. Presumably this is due to instruction tuning on coding tasks
I'd love to figure out a way to do it though, it seems to me that there's a bunch of rich description of the website in the html
I remember there was a paper which found that LLMs understand HTML pretty well, you don't need additional preprocessing.
The downside is that HTML produces more tokens than Markdown.
Possible approaches include transforming the text to MD or minimizing the HTML (e.g., removing script tags, comments, etc.).