| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danans 2390 days ago

> I'm curious now: would it be possible to use use the content/markup intended for use by the amp cache to view a static/unscripted/readable version of the page's main content? If so, why hasn't anyone built a browser extension to do so?

If the goal is to get around the AMP CDN, you don't even need to read the main page content. The AMP URL contains the original source URL itself [1].

The extension you are describing would just need to capture all requests with the prefix https://www.google.com/amp (or whatever CDN you are trying to get around), parse out the original URL, and then fetch it, and do what you will with it.

If the goal is to disable scripting on the AMP CDN delivered content, first note that AMP pages can't contain page-author-written JS [2], and any implicit JS has to run async.

But if that's insufficient, you can disable JS in the browser altogether, which would disable it in the loaded AMP content.

You could also try to parse out the main content from your extension from the AMP page if you know from the URL that it's an AMP page. Because AMP's forces relative terseness and simplicity of HTML content, it is probably easier to parse than original page's content. Obviously that won't generalize easily given the large variety of possible of content representations, but you stand a better chance of achieving this with AMP content than the original content.

And if you generalize it enough, you will end up with one component of a web crawl / indexing system in an extension ;)

1. https://blog.amp.dev/2017/02/06/whats-in-an-amp-url

2. https://amp.dev/about/how-amp-works/