Hacker News new | ask | show | jobs
by cryptonector 1620 days ago
I maintain Heimdal[0]'s ASN.1 compiler[1], though I didn't create it. It's a pleasure. It, and the IETF, have taught me a few things:

- there's nothing really wrong with ASN.1 as a syntax except maybe it's ugly

- there's nothing wrong at all with ASN.1's semantics

- there's a TON wrong with the BER family of encoding rules (BER, DER, and CER), and with every tag-length-value scheme

- you can create ASN.1 encoding rules for anything you like, which really means "use ASN.1 as the schema language for whatever encoding I prefer"

- indeed, there's XER (XML encoding rules), JER (JSON encoding rules), GSER (generic string encoding rules) -- all text-based -- and a bunch of binary encodings with at least two that are not tag-length-value (and so resemble NDR and XDR), like PER and OER

- people love to hate ASN.1, mainly because BER/DER/CER deserve the hatred, and for less legitimate reasons too, so they go off and invent new wheels that often have the same problems -- oh well!

  [0] https://github.com/heimdal/heimdal
  [1] https://github.com/heimdal/heimdal/tree/master/lib/asn1
3 comments

In the asn1 readme, and in some comments in these threads you mention the perils of the tag-length-value scheme, but you never seemed to explain whats wrong with it?

At least in file formats it to me seem they would be instrumental to have a extendible and flexible format, where you can skip unknown or uninteresting chunks (in say, PNG chunks, or IFF-based formats like OBJ, etc.).

Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?

Sorry about the wall of questions, but I'm just so confused.

> In the asn1 readme, and in some comments in these threads you mention the perils of the tag-length-value scheme, but you never seemed to explain whats wrong with it?

Not OP, but one of the challenges is that definite-length encodings like DER have to be encoded in a non-intuitive way. Values must be encoded prior to lengths (because the length is unknown), and the values can be nested. Therefore you have to encode a message essentially backwards when using definite-length encodings. This can potentially require a great deal of memory and can increase latency because streaming the data is hard.

Indefinite lengths (BER has this option, CER requires it) can help avoid this problem, but then you lose the benefit of skipping elements (which you allude to in your next paragraph).

> Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?

You've hit the tradeoffs pretty well in the question, I think. The nice thing about TLV is that you can decode without a schema and potentially work with the contents: it's a relatively simple format to decode and validate even if it's not necessarily great for the encoder.

ASN.1 supports schema-informed packed encodings that place greater demands on both the encoder and decoder. The main advantage is that they greatly reduce message overhead, but it requires a lot of bit-twiddling for presence/absence, default values, and, in unaligned variants, everything else, too. It's impossible, generally, to decode everything without the schema. PER has rules that disambiguate the values (e.g., they have to be ordered in a particular way, so you know what's coming next), and this mitigates some of the problems of TLV-style encodings.

The tradeoffs are worth it when your pipes are small. 3GPP and LTE messages are largely encoded in PER. The people playing in that world usually have plenty of money to spend on commercial solutions and have bandwidth to roll their own, too. That's a bit different than smaller shops who are looking for convenient automated serialization formats.

I see lots of questions about TLV scheme problems. I should have listed them last night, indeed.

First, some generic problems with TLV encodings:

  - they necessarily result in unnecessarily
    redundant encodings -- this is wasteful, bloat
  
  - that redundancy is of zero help to a compiler
  
  - that redundancy is a psychological crutch to
    any programmer writing hand-coded codecs, but
    this often has led to serious bugs
  
  - tag allocation has to be managed, and here
    again you really want a compiler to do it for
    you -- ASN.1 eventually added AUTOMATIC tags,
    but the damage of not having had those was
    done
Next some problems specific to DER-like definite-length TLV encoding rules:

  - streaming encoding is infeasible -- you have
    to know the definite lengths before you
    start encoding, so you lose
  
  - you either have to compute the length of the
    encoding of any value before you begin
    encoding it, or you have to encode "back to
    front" (and then possibly realloc as needed)
    or both
There's more, but I'm not too familiar with the issues around CER-like indefinite-length encoding issues.

Bottom-line: TLV is an unnecessary crutch. Compilers simply don't need it. For proof by existence consider that Sun's rpcgen(1) existed in 1986, a mere two years after ASN.1's 1984 standard, and rpcgen(1) uses XDR syntax and encoding -- XDR is NOT a TLV encoding at all. But ASN.1 tooling -proprietary and open source- took much longer to catch up with XDR and IDL/NDR and other things. It's almost like TLV encodings made it harder to get to compilation because they were a crutch for hand-coding codecs. But even XDR is easy to hand-write codecs for!

BTW, XDR and NDR were basically the first flatbuffers-like encodings. Lustre RPC has an even more flatbuffers-like encoding, but it's hand-coded. There's just nothing new in this space, and there hasn't really been anything new in this space in many years.

> At least in file formats it to me seem they would be instrumental to have a extendible and flexible format, where you can skip unknown or uninteresting chunks (in say, PNG chunks, or IFF-based formats like OBJ, etc.).

TLV is NOT necessary for this sort of extensibility. You naturally end up with something like TLV when using non-TLV encodings with support for extensibility, though it's often more like LTV. Let's say you have a struct you want to make extensible in some non-TLV encoding you're designing... What would you do? Well, knowing ASN.1's PER/OER and knowing how we've dealt with this in XDR I would do this: add an octet string field to the end of every extensible struct! What would that octet string contain? The encoding of the extensions. What if you want to support different kinds of extensions in a mix-and-match way? Well, that's easy too: add a discriminated union or "typed hole" to the end of every extensible struct, with every choice taken having a Length prepended to it so you can skip it.

Extensibility is something that has been beat to death in the ASN.1 space, and it has all of these options:

- extensibility markers in SEQUENCE / SET types (i.e., "struct" types)

- extensibility markers in CHOICE types (i.e., discriminated union types)

- extensibility markers in INTEGER and BIT STRING constraints (i.e., enum types)

- rules for handling known and unknown extensions in each ER (encoding rules)

- typed holes.

A typed hole is just a glorified discriminated union with an "external" sort of discriminant and specification of the union arms' types. Basically, a typed hole is just a struct with two fields: a) a type identifier of some sort (an integer, a string, an OID, a relative OID, whatever), b) an octet string containing an encoding of the value of a type identified by (a).

ASN.1 has syntax and semantics for expressing what type IDs go with what types, and so you can actually have compilers that recursively and automatically decode/encode through typed holes.

> Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?

I address this above. This is all addressed in ASN.1 (and also XML because of XMLNS). Many very smart people who came before you and I saw to it that ASN.1 addressed all these issues definitively long ago.

Maybe you can answer a question I've had about ASN.1. Long time ago, Marshall Rose had harsh things to say about the ASN.1 macro facility like "buried semantics"[1]. Do you know what he meant?

[1]: https://www-sop.inria.fr/rodeo/mavros/intro-mav.html Search for "Rose"

> Maybe you can answer a question I've had about ASN.1. Long time ago, Marshall Rose had harsh things to say about the ASN.1 macro facility like "buried semantics"[1]. Do you know what he meant?

My guess is that his complaint is that MACRO semantics are not well defined and are challenging to parse with conventional compilers. I've always wondered if they were inspired in some part by LISP, since you could in principle translate them fairly readily. ROSE and SNMP are still relatively commonly-used specifications that embed macro definitions, and most of the work I've seen done with them involves actually hard-coding the output (rather than actually parsing the MACROs).

Thanks for the link. Marshall Rose apparently has complaints about the ASN. MACRO facility.

You don't need ASN.1 MACROs for anything in Internet protocols, and you can do without more generally anyways.

I take it that means you have no idea what Marshall was talking about.
Rose was talking about a feature of ASN.1 (MACROs) that was removed and replaced with the Information Object System (x.681, x.682, x.683).

I'm not that familiar with the ASN.1 MACRO facility, no, because, after all, it's gone and replaced. My understanding is that the problems Rose identified led to the MACRO system being replaced -- good!

So yeah, I don't really know what Rose was talking about, but I do know plenty about the x.681/682/683 specs since I've implemented a subset of them.

> - there's a TON wrong with the BER family of encoding rules (BER, DER, and CER), and with every tag-length-value scheme

I would like to hear more about what's wrong with tag-length-value schemes. And can these be corrected or do would you advocate for alternatives? Which alternatives?