At least with HTML 5 we have both a spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/) and a library to parse it (https://github.com/google/gumbo-parser)