| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adityaathalye 95 days ago
	XML, JSON, YAML, RDF, EDN, LaTeX, OrgMode, Markdown... Plenty of plaintext, but structured information formats that are "yes, and". Yes, I can process them as lines of plain text, and I can do structured data transformations on them too, and there are clients (or readers) that know how to render them in WYSIWYG style.

2 comments

dwb 95 days ago

If that’s our definition of “plain text”, sure. I would still rather our tools were more advanced, such that printable and non-printable formats were on a more equal footing, though. I always process structured formats through something that understands the structure, if I can, so I feel that the only benefit I regularly get out of formats being printable is that I have to use tools that only cope with printable formats. The argument starts getting a bit circular for me.

link

adityaathalye 94 days ago

Hm, you made me think about non-printing characters as metadata, which is of course immediately lost on printing and therefore does not round trip between digital and printed versions.

Many nonprinting characters imply some directive; line break (hard-wrap the text here, but this is not a paragraph), page break (let the rest of the page be blank, start the next paragraph overleaf), EOL (file over, bye bye), nonbreaking space (keep these two words together, always, till death do them part).

This is out-of-band information spliced in-band (with the text corpus), which a computer program can "see", but a person can't.

link

zzo38computer 94 days ago

Yes, I thought of what you mentioned too, and in my opinion, DER is a better format, and it is a binary format rather than text.

(In my ideas of an operating system design, there is a structured binary format (similar to DER but different) used for most files and data, so that the tools (and the command shell) would be usable consistently with most of them; and if some need special handling, you can use other programs and functions to convert them and/or handle them in a way that can be interoperable.)

link

layer8 95 days ago

XML arguably isn’t plain text, but a binary format: If you add/change the encoding declaration on the first line, the remaining bytes will be interpreted differently. Unless you process it as a function of its declared (or auto-detected, see below) encoding, you have to treat it as a binary file.

In the absence of an encoding declaration, the encoding is in some cases detected automatically based on the first four bytes: https://www.w3.org/TR/xml/#sec-guessing-no-ext-info Again, that means that XML is a binary format.

link

zzo38computer 94 days ago

Another way that the character encoding could be declared is ISO 2022. When using ISO 2022, the declaration of UTF-8 is <1B 25 47>, rather than the <EF BB BF> that XML and some other formats use.

However, whether you do it that way or another way, I think that the encoding declaration should not be omitted unless it is purely ASCII in which case the encoding declaration should be omitted.

link