Hacker News new | ask | show | jobs
by civilian 1575 days ago
I was hoping this tool also solved a problem that comes from saving & reproducing JS-framework-heavy websites.

Here's the bug: According the HTML spec, elements like <h2> and <div> cannot be inside <a> tags. But using js you _can_ push <div>s instead of <a>s. (It happens from document.insert-type functions, frameworks like Angular/React allow this)

Look at nasa.gov, there's html:

  <a href="/press-release/nasa-invites-media-to-next-spacex-commercial-crew-space-station-launch-0" date="Wed Mar 02 2022 10:35:00 GMT-0800 (Pacific Standard Time)" id="ember196" class="card ubernode cards--card cards--2row cards--2col nodeid-477815 ember-view"><div class="bg-card-canvas" style="background-image: url(/sites/default/files/styles/2x2_cardfeed/public/thumbnails/image/51846702013_a0cc55100a_k.jpeg);">
  <!---->    <h2 class="headline"> ...
    </h2>
  </div>
  </a>
After running this through SingleFile you can visually see the changes, but the html changes are:

  <a href="/press-release/nasa-invites-media-to-next-spacex-commercial-crew-space-station-launch-0" date="Wed Mar 02 2022 10:35:00 GMT-0800 (Pacific Standard Time)" id="ember196" class="card ubernode cards--card cards--2row cards--2col nodeid-477815 ember-view"></a>
  <div class="bg-card-canvas" style="background-image: url(/sites/default/files/styles/2x2_cardfeed/public/thumbnails/image/51846702013_a0cc55100a_k.jpeg);">
  <h2 class="headline"> ...</h2>
The way that sites like Wayback Machine handle this is by using the web-replay library Wombat https://github.com/webrecorder/wombat that also uses JS to insert those elements.

But what the hell! I was working on a similar html-downloading/reproducing tool and this bug really bothers me. I'd either like the HTML reading standard to be updated to accept <div> inside of <a>, or also make that impossible to do via JS.

1 comments

I think this issue could be circumvented by manipulating the page (replacing images, frames, css etc.) in the tab itself (SingleFile does it in background with a DOMParser instance). The trick is to avoid HTML parsing.