The web, and markup therein, is very heterogeneous. We're using the Readability Content API (full disclosure, I am in charge of Readability as well) to do the initial cleanup in order to be able to turn arbitrary pages into the structured, validated data AMP requires.