Classification is just a start. Wondering if it's worth doing something more -- like turning all of the text into Markdown or HTML? Would anyone find that interesting?
There are a lot of webcrawlers where the chief feature is turning the website into markdown, I don't quite understand what they are doing for me thats useful since I can just do something like `markdownify(my_html)` or whatever, all this to say is that I wouldn't find this useful, but also clearly people think this is a useful feature as part of an LLM pipeline.
You don't want the footer or navigation in the output. Ideally you want the main content of the page, if it exists. How do you assign header level if they're only differentiated by CSS left-margin in a variety of units? How do you interpret documents that render properly but are hardly correct HTML?