| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rodarima 36 days ago

> Browsers absolutely decode as much as they can, and if the file is corrupted halfway through you generally get garbling, not the entire image being replaced by "fuck off". The only case where that is so is if the browser can't parse anything at all, or can't retrieve the file.

What I meant is that you don't expect PNG or JPEG images to be created in a way that the parser needs to run a complex process to reconstruct the bits that are broken and interpret what you meant to say. Like this one:

https://html.spec.whatwg.org/multipage/parsing.html#adoption...

Perhaps a better example is a C program being compiled into an executable. You don't expect the compiler to guess what you meant while parsing.

The current expectation is that a web browser must load any broken HTML and still display what it can, and is this expectation what I would like to change.

I don't propose humans to write this format directly (although it should be human readable), but compile it from something that is easy to write, like Markdown or a similar language. The objective is to enforce tools that make the transformation to produce a strictly conformant document.

Having a context-free grammar allows simple and fast parsing tools that can process your document, in a similar way that you can query or manipulate a JSON file with tools like jq because the grammar is simple and strict.

2 comments

pkasting 36 days ago

On the contrary, image decoders all run complex processes that try and guess what to do in erroneous cases. I used to maintain Chrome's image decoders, and every single image format has "what the spec says" and then "what people actually do in practice"; you must handle the latter, and it is often very difficult to figure out how to do so. For BMPs, for example, determining whether the author intended 24-bit RGB or 32-bit RGBA sometimes requires decoding the full image and scanning to see whether any pixels' alpha bytes differ from the others, since "all 00" and "all FF" might both be "no alpha".

I also used to work on a production C compiler. Compilers can and do "guess what you meant" in various cases, notably for producing actual human-readable errors or proceeding past various warnings, but if I recall correctly even in more obscure non-error cases.

Hyrum's Law is a real jerk sometimes.

link

rodarima 35 days ago

> For BMPs, for example, determining whether the author intended 24-bit RGB or 32-bit RGBA sometimes requires decoding the full image and scanning to see whether any pixels' alpha bytes differ from the others, since "all 00" and "all FF" might both be "no alpha".

This causes a situation in which a page that renders the image in Chrome doesn't work in Firefox or other browsers that don't implement the same non-standard correcting algorithm. Worse, the user doesn't have a way to know from Chrome that the image is broken and it will likely continue to be broken forever.

Generalizing this approach, you end up having to test your site in every major browser to see if you didn't made a mistake that is only revealed in the browser which lacks that recovery mechanism.

> I also used to work on a production C compiler. Compilers can and do "guess what you meant" in various cases, notably for producing actual human-readable errors or proceeding past various warnings, but if I recall correctly even in more obscure non-error cases.

I don't think this is addressing my point. When you write a C program, you expect the compiler to either recognize the program from the C grammar, or reject it because it is not correct (hence the concept of "error"). Then run whatever guessing algorithm to report to you what may be wrong.

The programmer expectation is that the program must strictly conform to the C grammar, and errors are corrected. It is not silently producing a half-reconstructed program assuming what you meant to say.

link

masklinn 36 days ago

> What I meant is that you don't expect PNG or JPEG images to be created in a way that the parser needs to run a complex process to reconstruct the bits that are broken and interpret what you meant to say.

So what you meant is neither what you wrote, nor what you advocate for?

because in case you have forgotten here is what you advocate for:

> Pages that don't conform with the specification won't be rendered.

That is not how image rendering in browsers works. That is how XHTML does not work.

> Perhaps a better example is a C program being compiled into an executable.

It's not a better example, because it's a completely different and unrelated use case: C programs are usually not dynamically generated, and even when they are the person who compiles the code is usually either the person who wrote it or a person who has ways to fix it or report errors.

Not so when trying to read a web random page on the web.

> I don't propose humans to write this format directly (although it should be human readable)

Approximately nobody wrote xhtml by hand, didn't save it.

This is also a nonsensical constraint-set on its face, there is no point to a human readable format which is not human writeable.

> The objective is to enforce tools that make the transformation to produce a strictly conformant document.

Ah, an open and non-monopolizable format which can only be written via an official toolchain.

> Having a context-free grammar allows simple and fast parsing tools that can process your document in a similar way that you can query or manipulate a JSON^H^H^H^HXML file with tools like jq^H^Hxmlstarlet because the grammar is simple and strict.

None of which seems of any use to a format which pretends to human production and consumption. JSON is an interchange format between machines.

link

rodarima 36 days ago

> Pages that don't conform with the specification won't be rendered.

I agree on what I wrote here. They will fail with an error indicating where the mistake is so you can correct it (more likely the tool that produced it).

>> The objective is to enforce tools that make the transformation to produce a strictly conformant document. > >Ah, an open and non-monopolizable format which can only be written via an official toolchain.

The objective is that when you make a tool like markdown-to-foo, the output follows the spec. There is no mention of any "official toolchain".

> xmlstarlet

XML is strict. Try to find the same tool for HTML5, especially for transformations.

> JSON is an interchange format between machines.

Is pretty much what the specification would try to cover.

link

anthk 36 days ago

Instead of C, maybe, Lisp, or Forth without messing the stack.

link