| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mehulashah 668 days ago
	Classification is just a start. Wondering if it's worth doing something more -- like turning all of the text into Markdown or HTML? Would anyone find that interesting?

1 comments

Treesrule14 668 days ago

There are a lot of webcrawlers where the chief feature is turning the website into markdown, I don't quite understand what they are doing for me thats useful since I can just do something like `markdownify(my_html)` or whatever, all this to say is that I wouldn't find this useful, but also clearly people think this is a useful feature as part of an LLM pipeline.

link

loa_in_ 668 days ago

You don't want the footer or navigation in the output. Ideally you want the main content of the page, if it exists. How do you assign header level if they're only differentiated by CSS left-margin in a variety of units? How do you interpret documents that render properly but are hardly correct HTML?

link

Treesrule14 668 days ago

Thanks, I guess, none of that stuff seemed super useful to cut systematically, but I'm gonna run some tests.

link