|
|
|
|
|
by 10165
3314 days ago
|
|
I know there was just a discussion yesterday on how amp is awful but it still is useful, e.g., to read WSJ articles. curl -o 1.htm https://www.wsj.com/amp/articles/the-quants-run-wall-street-now-1495389108
sed -n '/./{/<title/,/<\/title/p;/<p>/,/<\/p>/p;}' 1.htm > 2.htm
FWIW, 2.htm has no amp elements, no Javascript, no images, no ads, no externally sourced resources and therefore no tracking.Add links to non-essential images (cf. auto-loaded by browser). With available captions. sed -n '
/./{/div class=.image/,/<\/div/!d;s/ *//;}
/src=/{s///;s/\"//g;s/.*/<a Href=&>&<\/a><br>/;}
/alt=/{s///;s/[\">]//g;/./s/.*/<P>above: &<\/p>/;}
/Href=/p;/<P>/p' 1.htm >> 2.htm
|
|
I added some bare-minimum CSS to make it a little nicer to read. Full command (with in-place sed):