Completely agree. A friend and I tried to do something like this as a fun project at a hackathon, getting to 80% wasn't difficult, just a lot of parsing the DOM for articles. Dealing with things like adverts, photo captions, comments, and other text that shouldn't be in the actual article was the real pain -- especially when we wanted to detect paragraph/subheader breaks since we wanted to parse articles and text-to-speech.
Good point, the constant (constant!) maintenance aspect means there would need to be a sustainable plan. On the other hand, if lots of projects started depending on the library, you'd at least get a steady supply of notifications about breakage, and perhaps fixes as well.