Hacker News new | ask | show | jobs
by lelandfe 660 days ago
Only works insofar as sites are being nice. A lot of sites do things like: render all text via JS, render article text via API, paywall content by showing a preview snippet of static text before swapping it for the full text (which lives in a different element), lazyload images, lazyload text, etc etc.

DOM parsing wasn't enough for Google's SEO algo, either. I'll even see Safari's "reader mode" fail utterly on site after site for some of these reasons. I tend to have to scroll the entire page before running it.

1 comments

It's possible to capture the DOM by running a headless browser (i.e. with chromedriver/geckodriver), allowing the js execute and then saving the HTML.

If these readers do not use already rendered HTML to parse the information on the screen, then...

Indeed, Safari's reader already upgrades to using the rendered page, but even it fails on more esoteric pages using e.g. lazy loaded content (i.e. you haven't scrolled to it yet for it to load); or (god forbid) virtualized scrolling pages, which offloads content out of view.

It's a big web out there, there's even more heinous stuff. Even identifying what the main content is can be a challenge.

And reader mode has the benefit of being ran by the user. Identifying when to run a page-simplifying action on some headlessly loaded URL can be tricky. I imagine it would need to be like: load URL, await load event, scroll to bottom of page, wait for the network to be idle (and possibly for long tasks/animations to finish, too)