Hacker News new | ask | show | jobs
by yawnxyz 655 days ago
I found that reducing html down to markdown using turndown or https://github.com/romansky/dom-to-semantic-markdown works well;

if you want the AI to be able to select stuff, give it cheerio or jQuery access to navigate through the html document;

if you need to give tags, classes, and ids to the llm, I use an html-to-pug converter like https://www.npmjs.com/package/html2pug which strips a lot of text and cuts costs. I don't think LLMs are particularly trained on pug content though so take this with a grain of salt

2 comments

Hmmm. That's interesting. I wish there was a Node-RED node for the first library (I can always import the library directly and build my own subflow, but since I have cheerio for Node-RED and use it for paring down input to LLMs already...)
But OP did a (admittedly flawed) test. Have you got anything to back up your claim here? We've all got our own hunches but this post was an attempt to test those hypotheses.
haha I haven't tested it if it's efficient or anything, I just put it together as a pipeline on val.town and I've been using it for parsing.

Shouldn't take more than 5 minutes to put together w/ Claude tbh