Hacker News new | ask | show | jobs
The specs behind the specs – a deep-dive on ASN.1 (engineering.wgtwo.com)
82 points by torotime 1620 days ago
9 comments

I maintain Heimdal[0]'s ASN.1 compiler[1], though I didn't create it. It's a pleasure. It, and the IETF, have taught me a few things:

- there's nothing really wrong with ASN.1 as a syntax except maybe it's ugly

- there's nothing wrong at all with ASN.1's semantics

- there's a TON wrong with the BER family of encoding rules (BER, DER, and CER), and with every tag-length-value scheme

- you can create ASN.1 encoding rules for anything you like, which really means "use ASN.1 as the schema language for whatever encoding I prefer"

- indeed, there's XER (XML encoding rules), JER (JSON encoding rules), GSER (generic string encoding rules) -- all text-based -- and a bunch of binary encodings with at least two that are not tag-length-value (and so resemble NDR and XDR), like PER and OER

- people love to hate ASN.1, mainly because BER/DER/CER deserve the hatred, and for less legitimate reasons too, so they go off and invent new wheels that often have the same problems -- oh well!

  [0] https://github.com/heimdal/heimdal
  [1] https://github.com/heimdal/heimdal/tree/master/lib/asn1
In the asn1 readme, and in some comments in these threads you mention the perils of the tag-length-value scheme, but you never seemed to explain whats wrong with it?

At least in file formats it to me seem they would be instrumental to have a extendible and flexible format, where you can skip unknown or uninteresting chunks (in say, PNG chunks, or IFF-based formats like OBJ, etc.).

Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?

Sorry about the wall of questions, but I'm just so confused.

> In the asn1 readme, and in some comments in these threads you mention the perils of the tag-length-value scheme, but you never seemed to explain whats wrong with it?

Not OP, but one of the challenges is that definite-length encodings like DER have to be encoded in a non-intuitive way. Values must be encoded prior to lengths (because the length is unknown), and the values can be nested. Therefore you have to encode a message essentially backwards when using definite-length encodings. This can potentially require a great deal of memory and can increase latency because streaming the data is hard.

Indefinite lengths (BER has this option, CER requires it) can help avoid this problem, but then you lose the benefit of skipping elements (which you allude to in your next paragraph).

> Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?

You've hit the tradeoffs pretty well in the question, I think. The nice thing about TLV is that you can decode without a schema and potentially work with the contents: it's a relatively simple format to decode and validate even if it's not necessarily great for the encoder.

ASN.1 supports schema-informed packed encodings that place greater demands on both the encoder and decoder. The main advantage is that they greatly reduce message overhead, but it requires a lot of bit-twiddling for presence/absence, default values, and, in unaligned variants, everything else, too. It's impossible, generally, to decode everything without the schema. PER has rules that disambiguate the values (e.g., they have to be ordered in a particular way, so you know what's coming next), and this mitigates some of the problems of TLV-style encodings.

The tradeoffs are worth it when your pipes are small. 3GPP and LTE messages are largely encoded in PER. The people playing in that world usually have plenty of money to spend on commercial solutions and have bandwidth to roll their own, too. That's a bit different than smaller shops who are looking for convenient automated serialization formats.

I see lots of questions about TLV scheme problems. I should have listed them last night, indeed.

First, some generic problems with TLV encodings:

  - they necessarily result in unnecessarily
    redundant encodings -- this is wasteful, bloat
  
  - that redundancy is of zero help to a compiler
  
  - that redundancy is a psychological crutch to
    any programmer writing hand-coded codecs, but
    this often has led to serious bugs
  
  - tag allocation has to be managed, and here
    again you really want a compiler to do it for
    you -- ASN.1 eventually added AUTOMATIC tags,
    but the damage of not having had those was
    done
Next some problems specific to DER-like definite-length TLV encoding rules:

  - streaming encoding is infeasible -- you have
    to know the definite lengths before you
    start encoding, so you lose
  
  - you either have to compute the length of the
    encoding of any value before you begin
    encoding it, or you have to encode "back to
    front" (and then possibly realloc as needed)
    or both
There's more, but I'm not too familiar with the issues around CER-like indefinite-length encoding issues.

Bottom-line: TLV is an unnecessary crutch. Compilers simply don't need it. For proof by existence consider that Sun's rpcgen(1) existed in 1986, a mere two years after ASN.1's 1984 standard, and rpcgen(1) uses XDR syntax and encoding -- XDR is NOT a TLV encoding at all. But ASN.1 tooling -proprietary and open source- took much longer to catch up with XDR and IDL/NDR and other things. It's almost like TLV encodings made it harder to get to compilation because they were a crutch for hand-coding codecs. But even XDR is easy to hand-write codecs for!

BTW, XDR and NDR were basically the first flatbuffers-like encodings. Lustre RPC has an even more flatbuffers-like encoding, but it's hand-coded. There's just nothing new in this space, and there hasn't really been anything new in this space in many years.

> At least in file formats it to me seem they would be instrumental to have a extendible and flexible format, where you can skip unknown or uninteresting chunks (in say, PNG chunks, or IFF-based formats like OBJ, etc.).

TLV is NOT necessary for this sort of extensibility. You naturally end up with something like TLV when using non-TLV encodings with support for extensibility, though it's often more like LTV. Let's say you have a struct you want to make extensible in some non-TLV encoding you're designing... What would you do? Well, knowing ASN.1's PER/OER and knowing how we've dealt with this in XDR I would do this: add an octet string field to the end of every extensible struct! What would that octet string contain? The encoding of the extensions. What if you want to support different kinds of extensions in a mix-and-match way? Well, that's easy too: add a discriminated union or "typed hole" to the end of every extensible struct, with every choice taken having a Length prepended to it so you can skip it.

Extensibility is something that has been beat to death in the ASN.1 space, and it has all of these options:

- extensibility markers in SEQUENCE / SET types (i.e., "struct" types)

- extensibility markers in CHOICE types (i.e., discriminated union types)

- extensibility markers in INTEGER and BIT STRING constraints (i.e., enum types)

- rules for handling known and unknown extensions in each ER (encoding rules)

- typed holes.

A typed hole is just a glorified discriminated union with an "external" sort of discriminant and specification of the union arms' types. Basically, a typed hole is just a struct with two fields: a) a type identifier of some sort (an integer, a string, an OID, a relative OID, whatever), b) an octet string containing an encoding of the value of a type identified by (a).

ASN.1 has syntax and semantics for expressing what type IDs go with what types, and so you can actually have compilers that recursively and automatically decode/encode through typed holes.

> Do you feel that the same doesn't apply to serialisation formats? How are the non-tlv binaries encoded then? Just implied offsets according to the schema? Can you then evolve the schema at all, or do you feel that both producer and consumer should have always access to the full schema, and flexiblity here is a non-feature?

I address this above. This is all addressed in ASN.1 (and also XML because of XMLNS). Many very smart people who came before you and I saw to it that ASN.1 addressed all these issues definitively long ago.

Maybe you can answer a question I've had about ASN.1. Long time ago, Marshall Rose had harsh things to say about the ASN.1 macro facility like "buried semantics"[1]. Do you know what he meant?

[1]: https://www-sop.inria.fr/rodeo/mavros/intro-mav.html Search for "Rose"

> Maybe you can answer a question I've had about ASN.1. Long time ago, Marshall Rose had harsh things to say about the ASN.1 macro facility like "buried semantics"[1]. Do you know what he meant?

My guess is that his complaint is that MACRO semantics are not well defined and are challenging to parse with conventional compilers. I've always wondered if they were inspired in some part by LISP, since you could in principle translate them fairly readily. ROSE and SNMP are still relatively commonly-used specifications that embed macro definitions, and most of the work I've seen done with them involves actually hard-coding the output (rather than actually parsing the MACROs).

Thanks for the link. Marshall Rose apparently has complaints about the ASN. MACRO facility.

You don't need ASN.1 MACROs for anything in Internet protocols, and you can do without more generally anyways.

I take it that means you have no idea what Marshall was talking about.
Rose was talking about a feature of ASN.1 (MACROs) that was removed and replaced with the Information Object System (x.681, x.682, x.683).

I'm not that familiar with the ASN.1 MACRO facility, no, because, after all, it's gone and replaced. My understanding is that the problems Rose identified led to the MACRO system being replaced -- good!

So yeah, I don't really know what Rose was talking about, but I do know plenty about the x.681/682/683 specs since I've implemented a subset of them.

> - there's a TON wrong with the BER family of encoding rules (BER, DER, and CER), and with every tag-length-value scheme

I would like to hear more about what's wrong with tag-length-value schemes. And can these be corrected or do would you advocate for alternatives? Which alternatives?

Can the veterans of the 90s SSL Wars explain the issues with ASN1/DER/BER? Looking it up today, it seems like a pretty smart and extensive serialization system, and I have to wonder why new systems like Google Protobufs chose to reinvent the wheel.

Conversely, how have modern systems avoided the pitfalls (if any) of ASN1/DER/BER?

I know of at least one problem with ASN.1. The string encodings other than UTF-8 are terrible. Most of the string encodings are very limited and weird subsets of ASCII that nobody actually uses anymore. ASN.1 itself doesn't define the encodings and just refers to other standards.

The problem with this is probably most notable with the T.61 encoding which changed over the years and since ASN.1 references other standards nobody is quite sure exactly what you have to support to have T.61 actually work right.

Within X.509 certificates though nobody bothers to actually implement T.61 and just uses the T.61 flag for ISO-8859-1.

There are a bunch of gory details around this mess in this (now quite old) write-up here: https://www.cs.auckland.ac.nz/~pgut001/pubs/x509guide.txt

Since that write up I believe UTF-8 is pretty much the expectation for character encoding for X.509.

I documented some of the quirks around 6 years ago when I took an existing X.509 parser and improved it for use in certificate trust management in Subversion: http://svn.apache.org/viewvc/subversion/trunk/subversion/lib...

Basically ASN.1 wasn't well defined and it only works well when people agreed to only use certain features or to interpret things in a particular way when ambiguous.

It's also notoriously difficult to parse well. It's very easy to have bugs in your parser, even if you're implementing a subset of it that's needed for X.509. Especially if you're doing so in a non-memory safe language.

I can't speak for why Google invented Protobufs, but I can't imagine anyone sane picking up ASN.1 for anything modern and deciding that this is what they want to use.

For the string encoding thing, however, it does have UTF-8 and you should not use anything else to express arbitrary human text anyway.

PKIX actually leverages the weird encoding restriction to our benefit. It defines two kinds of names which things might have on the Internet (you can and should stop trying to name things which are actually on the Internet some other way), DnsNames and IpAddresses. IpAddresses, since they're either 32-bit or 128-bit arbitrary bit values, are just represented as either 32-bit or 128-bit arbitrary bit values. So you cannot express the erroneous IPv4 address 100.200.300.400 as an IpAddress, which means you can't trip up somebody's parser with that nonsense address. DnsNames use a deliberately sub-ASCII encoding from ASN.1 which can express all the legal DNS names (all A-labels and the ASCII dot . are permissible) but can't express lots of other goofy things including most Unicode. So a certificate issuer, even if they're completely incompetent, cannot write a valid DnsName that expresses some garbage IDN as Unicode. Hopefully they read the documentation and find out they need to use A-labels (Punycode) but if not they're prevented from emitting some ambiguous gibberish.

Even in forums where you'd once have expected pushback, "Just use UTF-8" is becoming more widespread. Microsoft for example, once upon a time you'd get at least some token resistance, today they're likely to agree "Just use UTF-8". So ASN.1 ends up no worse off for a half a dozen bad ways to write text you shouldn't use, compared to say XML, HTML, and so on.

Agree, although the right thing to do helps in specific applications but not so much in the general case. You're very often stuck with other people's MIBs / specs and encoders, trying to make sense of what a) they're allowed to put on the wire and b) what they actually do and under what circumstances.
A couple of years ago I ran into the same confusion of the "TeletexString"/"T61String" data type in ASN.1. After going down the rabbit hole of what is T.61 and trying to map it to Unicode, I reread the ASN.1 (X.690) spec and realized that the authors never actually referenced T.61. Ever since the first edition of ASN.1 in 1988, those strings have not used T.61. They use a character set that is easily mapped to Unicode - https://www.itscj-ipsj.jp/ir/102.pdf, a subset of US ASCII.

Not to say the rest of the spec is notably better. If fully implemented, it requires supporting escape codes in strings to change character sets. I've never seen valid escape codes in real world data, but it probably exists.

As the original article shows, ASN.1 has lots of other challenges and complexity. Trying to write a code generator that supports all the complexity is no trivial task and the only open source one I've seen only generates C code. Protobuf has the advantage of having modern language support (including multiple type safe and memory safe languages).

Eh... It does have a transitive normative reference to T.61, but only by way of special restrictions on the use of three characters.

T61String is defined in terms of ISO 2022, with the default C0 Character set set to ISO-IR-102 (as you linked). ISO-IR-102 defines the set of graphical characters, but also places a condition on the use of 3 of them by reference to T.61. It also requires that the control character set C0 be set to ISO-IR-106 by default, and ISO-IR-107 for C1.

The net effect is that the default character set of T61String is almost the T.61 character set, except that to get the T.61 character set, you need to include the escape sequence to set G1 to ISO-IR-103. ESC 2/9 7/6

A conforming T61String implementation does need to support the escape sequences and resulting encodings from ISO-IR-6, ISO-IR-87, ISO-IR-102, ISO-IR-103, ISO-IR-106, ISO-IR-107, ISO-IR-126, ISO-IR-144, ISO-IR-150, ISO-IR-153, ISO-IR-156, ISO-IR-164, ISO-IR-165, ISO-IR-168.

Since the control character sets include shift prefixes etc, properly parsing T61Strings into Unicode is non-trivial.

This is actually a pretty good reflection of the complexity in ASN.1. Technically the ASN.1 spec proper only requires that a T61 string support exactly the set of characters specified in the above registrations. It does not mandate any particular format, for them. It is the BER encoding that requires that ISO2022 be used to encode these. A different encoding could specify that all strings are encoded as UTF-8, and the different types are just various subsets of allowed characters.

Heimdal's ASN.1 compiler generates C code. It also generates bytecode with C bindings. Two options.

Also, I've made it generate JSON dumps of the ASN.1 modules. My goal is to eventually replace the C-coded backends that generate C / bytecode with jq-coded backends that can generate C, Java, Rust, etc.

> Basically ASN.1 wasn't well defined and it only works well when people agreed to only use certain features or to interpret things in a particular way when ambiguous.

ASN.1 has always been as-well- or better-defined than its competition. The ITU-T specs for it are a thing of beauty not often equaled outside the ITU-T.

That said, for a long time the ASN.1 specs were non-free, and that hurt a lot. Also, the BER family of encoding rules stunted development of open source tooling for ASN.1.

> I can't imagine anyone sane picking up ASN.1 for anything modern and deciding that this is what they want to use.

Part of my curiosity stems from Apple using it as part of their bootable file-format: https://www.theiphonewiki.com/wiki/IMG4_File_Format

But as you say, I have to assume they're using it in a very constrained way.

> Part of my curiosity stems from Apple using it as part of their bootable file-format: https://www.theiphonewiki.com/wiki/IMG4_File_Format

I could only speculate, but I wonder if part of the reason is that DER is completely unambiguous and therefore suitable for cryptographic services. It's also very easy to decode without a specification (TLV format). Apple are almost certainly using ASN.1 compilers for their mobile devices and security layers (even if they ship FOSS implementations, I'd be surprised if they aren't checking their work with commercial compilers), so there's overlap there. Rolling your own format in that case could be unnecessary and another failure point that could be rolled into a single unit.

One should not design cryptographic protocols so that they require canonical encodings.

Instead one should write tooling that produces decoders that preserve the original encoding of signed data.

> Instead one should write tooling that produces decoders that preserve the original encoding of signed data.

That's an interesting idea. How do you evaluate the tradeoffs in this design? I.e., what does it buy you compared to saying that you need to sort in tag order, for example? (Assume that you have something like an automatic tagging environment for sake of argument.)

> The string encodings other than UTF-8 are terrible.

Well, yes, because ASN.1 predates Unicode.

Oh where to begin?

ASN.1 really demands code generation. Unfortunately lots of nonconforming stuff has to be dealt with. The concept of encoding rules and the module tagging scheme make for a pretty big number of possible representations.

The language semantics of ASN.1 don't really map to anything well, particularly around default fields and structures that can vary.

Newer systems don't have encoding rules and pick a semantics that matches a target language much more closely.

> ASN.1 really demands code generation.

Nope, nyet, bzzt. Proofs by counter-example:

- OpenLDAP has a printf/scanf-like approach to BER encoding

- Heimdal has an ASN.1 compiler that generates code, yes, but also alternatively generates bytecode that gets interpreted at run-time.

> The language semantics of ASN.1 don't really map to anything well, particularly around default fields and structures that can vary.

You are ill-informed. Proof by counter-example:

- there are ASN.1 encoding rules that produce natural XML (XER) and JSON (JER)

- "default fields" are supported (the relevant keyword is `DEFAULT`, naturally)

- "structures that can vary" -- if you mean unions, it's got that (the relevant keyword is `CHOICE`), and if you mean "extensions", it's got extensibility markers (that effectively are alike a CHOICE of an octet string of unknown stuff, or else the extensions known at module compile time.

I have worked on code that took the OpenLDAP approach. It sucked, guiding to partial parsing and processing. The rest of your question misunderstands the nature of semantics I'm talking about. It's not that we can't make XML or JSON it's that programming languages often don't have types that map naturally to all of ASN.1 default not nil doesn't work in Go for example.
Oh, I agree. I don't like the printf/scanf-like approach to BER encoding. In fact, it's awful.

The point I was making is that code generation is not the only option for ASN.1 or any encoding.

Also, ASN.1 types map very well onto C (surprise):

- OCTET STRING -> struct with pointer and length in bytes

- BIT STRING -> struct with pointer and length in bits

- INTEGER (constrained) -> some stdint.h integer type

- INTEGER (unconstrained) -> struct with pointer to array of uint64_t, array element count, and boolean to indicate if signed or unsigned

- REAL -> double or some arbitrary precision real library's type

- most string types -> pointer to array of char, or counted byte string type

- SEQUENCE OF and SET OF -> struct with pointer to array and count of elements

- SEQUENCE and SET -> struct

- CHOICE -> struct with discriminant enum and union of alternatives

- tags -> ignore

- OPTIONAL -> pointer

- DEFAULT -> nothing special

- NULL -> int (whatever)

- BOOLEAN -> unsigned int, bool, maybe a bitfield of unsigned integer type so that all booleans can be compressed, etc.

- OBJECT IDENTIFIER and RELATIVE OBJECT IDENTIFIER -> struct with pointer to DER encoding, and length in bytes

- extensibility markers -> [hard to make this pithy, but it can be handled just fine]

That covers like 99% of it. Suffice it to say that there's a very natural mapping of most of ASN.1 onto C.

Things like classes and object sets aren't types but can guide the tooling to provide automatic encoding and decoding through open types (typed holes).

BTW, `SET` is silly. `SET OF` is only of interest if you have arrays where order doesn't matter and you want a canonical encoding, but since one should not depend on canonical encodings, `SET OF` is also silly. IMO both should be deprecated (they can't be removed, but hey).

> ASN.1 really demands code generation.

On this specific point: isn't this also the case for other high-performance serialisers? Google ProtoBufs, Apache Thrift, any protocol through Rust's SerDes...

Not really. You can trivially encode or decode protobuf or thrift at runtime, given a message specification, and this isn't uncommon in the wild. It's just that you usually expect messages which are well-defined at build time, so why not generate code?
No, it's not. There is no reasonable syntax/IDL/schema/whatever you want to call it for which you wouldn't have a choice of implementing by code generation or by bytecode generation.

How is that not obvious? It would be like saying that "the problem with LISP is that it has to be interpreted", or that "the problem with C is that it can only be compiled to object code", when both such statements are clearly incorrect because of real-life counter-examples.

But there is something special to ASN.1. Instead of seeing that there's nothing new under the Sun when it comes to data encoding and schemata, and that there hasn't been anything new in that field really since S-expressions, ASN.1 has engendered a special hatred that blinds everyone to things that they would grant as obvious in other cases.

There isn't in the wild nonconformant data you also need to live with out there for most of them. The combination is unholy.
Also expect to pay to read the spec.
ASN.1 standards are free: https://www.itu.int/rec/t-rec-x.680/en

Many, though not all, specifications that use ASN.1 are also freely available. I've been out of telecom for awhile, so I don't know the status of the newer standards, but when I was working in the business GSM MAP and MMS were the only proprietary ones that were an issue.

GSM standards are also freely available --- look at 3gpp.org or etsi.org --- the biggest problem is finding which ones actually contain what you're looking for.
The ITU-T ASN.1 specs have been free for a very long time now. They used to be non-free, and that was a big problem with ASN.1, but that was decades ago.
There is NO problem with ASN.1 itself except a bit of ugliness. There are SERIOUS problems with DER/BER/CER and with all tag-length-value schemes -- this includes protobufs!

ASN.1 is just syntax and semantics. There are encoding rules that produce textual representations (GSER), XML (XER), JSON (JER), there's XDR-style encoding rules (PER and OER, but with 1-octet units instead of 4-octet units, plus efficient representation of optional fields).

In fact, you can make ASN.1 encoding rules that are based on NDR and XDR and which work for all of IDL and XDR and that subset of ASN.1 that is covered by the semantics of IDL and XDR, and you can extend that to cover all of ASN.1 if you want.

I should know these things, as I maintain an ASN.1 compiler and I intend to eventually teach it to do XDR and NDR.

Really, there's nothing about data schemas that you can express in JSON, CBOR, IDL, XDR, S-expressions, or any schema language you want, that you can't express in ASN.1, or, if there is, it's got to be a pretty niche feature and easily added to ASN.1 anyways. Even functions (RPCs) can be expressed in ASN.1 with some conventions, and routinely are, because it's really just a request/response protocol.

But every year someone invents a new thing because of how stupid, tired, and old ASN.1 is (or, rather, they perceive it to be). Or because of how complex ASN.1 is and how there's a paucity of tools, so then they: reinvent the wheel (often badly), a wheel for which instantly there is a paucity of tools.

Personally, I think that people just like to reinvent things. I don't want to sound shitty (or have kentonv show up again to scold me for it) but I get the feeling that, a lot of the time, it's just that simple.

https://news.ycombinator.com/item?id=20725550

To me that is a specious argument. It's like asking why Python was invented when Cobol could suffice.

The dozens of ASN.1 specs are absolutely hideous and entrenched in obsolete telecom jargon. If the sole goal Protobuf was to avoid having Google engineers be required to refer to the dozens of ASN.1 specs when disagreements or confusions arose, then it would have been 100% worth it for just that reason.

First, let me confess that I don't have enough experience with ASN.1 or Protobufs to have an informed opinion.

The supporting argument for the "because it's there" hypothesis for why people reinvent things (in IT) is that they do it so often.

Even if all the newer message/serialization systems are better than ASN.1, they're not all better than each other, eh? Why so many? Same goes for chat systems, programming languages, etc.

There has been a lot more new stuff in the world of programming languages, even recently, than there has been in the world of data schemata and encoding rules.

That said, most of the innovation in programming language theory has been around Haskell and related languages, and it has not justified languages like Golang or Python. DSLs in general are justified regardless of whether they are innovative in terms of programming language theory.

The ASN.1 specs are beautiful. They are beautifully written, better than anything the IETF produces because the ITU-T is an expensive standards development organization that can afford to have people who only do this sort of thing.

The ASN.1 specs are very readable. Much easier to read than many important RFCs.

ASN.1 was too broad. There is immense value in a more constrained specification that does not include so many hazardous serialization types and antiquated string formats.

Now, should Protobufs or Thrift simply have been constrained versions of ASN.1? I think there is a view of software engineering where this would have been an ideal outcome, but almost universally when we see too-big standards, they are declared "dangerous" and avoided like the plague before they are downscoped.

ASN.1 in 1984 was not too broad. It was too simple, and it was too targeted to tag-length-value encoding rules (which are stupid -- TLV is a crutch that is only maybe useful when you lack a compiler, which early on was the case).

ASN.1 today is as broad as it needed to evolve to be because its users needed it.

There is value in throwing away cruft, especially cruft that comes from the IT Middle Ages (before we decided to drop any non 8 bit word sizes, before UTF-8 became the almost universal string encoding, etc.).
ASN.1 is extremely complicated and hard to implement correctly. All ASN.1 implementations I've seen are either specialized (know how to work only with a very specific message), or slow, buggy and expose equally complicated APIs. Modern systems like protobufs tend to use much simpler encodings & specs which are easier to understand and implement correctly.
Have spent a few years during the late 90s/early 2000s in an industry running on ASN.1, coming from the web. I was initially surprised by how enamoured most of my coworkers were with ASN.1 and its tools, but it grew on me too: the pleasure of interacting only with a protocol specifications regardless of the implementation language/intricacies of the remote party, the guaranty that there could be no invalid messages received or emitted, the automatic generation of tests and tools, eventually balanced out the inconvenience of not being able to readily read data on the wire (it was before every human-readable protocols gets encrypted) and the inconvenience of not being able to start coding upfront.

It was like going from runtime type checking to static type checking: initially inconvenient, but paying dividends after a short while.

So why did this tech disappeared if it was ultimately better than the later alternatives (textual protocols, shema-less serializers, and eventually protobuf which reinstated some form of efficient encoding and type checking).

As it uncannily frequently occurs with technological evolution, the reason is probably not to be found within its technical issues (which basically all boil down to: designed by committee).

ASN.1 was just a bit too inconvenient, the free tools to generate code were just not quite good and robust enough, and the approach of starting with designing your types and protocols and putting in place your code production tool-chain before being able to ship anything was at odd with the mood of the day, which was to let the junior cheap dev fire off his code editor during the coffee break of the first design planning meeting to build the first half-backed prototype that would be already sold to the customer by the time he hits :wq. To move fast and break things, ASN.1 got in the way.

So did formal specifications in general, code analyzing tools, even basic type checking, all of them thrown out the window during the same period for the extra weight, extra time-to-market and extra cost of hiring. Text protocols out competing saner alternatives because they are initially simpler (SIP vs H.323 anyone?), schema-less data formats predominating almost entirely because you can start hacking quicker, etc. are all attributable to that cultural rather than technical trend I believe.

Now it seems the industry is slowly recovering from these excesses. Maybe because of the damage that has done, but more likely because of the end of cheap hardware progresses, encryption everywhere and massive data volumes (that's what made Google come up with better protocols than HTTP and better formats than human readable text, after all).

I owned the Microsoft ASN1 library for a while around 2005. It was a maintenance nightmare and I spent a lot of time fixing static analysis derived issues.

That said, I always found the standard quite interesting with different encodings based on the degree of prior shared info or format. My assumption is that not-invented-here is part of the why it’s not used.

I own Heimdal's ASN.1 compiler. It's a pleasure.
I used the Netscape/Mozilla NSS library quite a bit, and one problem I found with it, is that all of the DER encoding/decoding was written by hand. They should have generated all that boilerplate from the ASN.1 modules written in the specs (later, RFC 2459, but at the time, a hodge-podge of scattered specs).

Hand-coding works okay when the data is what you expect. But when you throw mal-formed certificates at it, you have to catch all the edge cases. Having generated code would have enabled much more edge cases to be covered.

Those libraries were originally written in the early/mid 90s. Don’t recall much in the way of code generation tools that would take those specs and generate the code at the time.

Spent a bunch of time working with and adding to those libraries.

10 different string encodings is one problem.
Is it ? You pick the one that fits your use, normally UTF8String these days
Can one use UTF8?

The 90s were rough on text encoding, but it seems pretty settled now.

> Can one use UTF8?

For new standards, yes. But ASN.1 was first specified in the '80s, and backwards compatibility is a thing. So really it depends on what you're doing: if you can start with a subset of ASN.1, which I think is done in MDER[0] and OER[1], you have a bit more freedom. But if you're working in legacy formats and standards that operate internationally, you could run into problems.

[0]: https://www.iso.org/standard/66717.html

[1]: see among others https://www.ntcip.org/document-numbers-and-status/

Kerberos implementations generally just-send-whatever in IA5String fields. That means Windows sends UTF-8, and MIT Kerberos and Heimdal send whatever the user's locale uses. Windows doesn't normalize or anything. It works in that a) it interops when using ASCII names, b) it interops when using non-ASCII names in UTF-8 locales on Unix. It violates the spec, but it works.
Stick to UTF8String. ASN.1 predates Unicode.
Or IA5SString if you know, ahead of time, that you only need ASCII.
Or do what many implementors do: just send whatever you have as whatever string type the protocol spec requires.
No veteran of the 90s SSL wars, but I once upon the time was tasked with fixing security bugs in a custom protocol backend server which used ASN.1 for purposes that one would probably use protobuf nowadays.

The quality of existing open source libraries to parse ASN.1 leaves a lot to be desired.

When I first saw protobufs, I wondered exactly the same thing.

There’s an “XER” if you want a human-readable XML encoding, too.

I have worked for a time with credit card terminal applications.

We used BER-TLV throughout the system extensively, where it was needed as well as where it wasn't.

I have implemented complete parsers/serializers, data structures using TLV, transactional database where data was stored as TLV documents. EMV is built on top of BER-TLV, SSL used it, as well as ISO-8583 messages transmitted data encoded with BER-TLV. Communication with the PIN Pad was built on it. We kept configuration as BER-TLV documents.

I have been able to parse hex representation in my head.

I really liked the standard. It is nice, flexible and very efficient. Easy to parse, can be parsed reliably and safely in statically allocated memory.

To those who think this is ancient history and it should be dropped -- do you think that might just be because you don't actually know it or maybe you just think it is old and so it must be bad?

Where EMV uses tags more like classes than types, I’m not really sure it actually counts as “abstract” syntax notation any more?

Because all tags are these custom things, some don’t strictly parse out to unique type codes too. So a non-EMV parser will have a few tags that map to the same integer code and cause some fun bugs.

That project was when I really understood deep-down why JSON won in the end!

Why are we even talking about ASN.1/DER/BER? We should, like the ancient Egyptian priests who opposed Akhenaten, chisel it's name from every public edifice. Referring to it not as "ASN.1, the platform-independent abstract type system," but "the great heresy, which shall not be named."
Hand in your X.509 certificate at the door. (There was a proposal to do x509 as s-expressions. The road not taken...)

  ANY DEFINED BY ANY -- all following comments
ASN.1 is glorified S-expressions.

BER/DER/CER is binary S-expressions.

Anything you can represent in a parse tree is an s-expression the point was during the discussions of SPKI we talked about using canonical s-expression notation rather than ASN.1 to represent the TBS and other structural forms. SPKI used them. It stood in contradistinction to X.509 in ASN.1

If my memory serves me right Kent and others argued for continuance of ASN.1 to keep alignment with the CCITT/ITU on standards docs.

This was a long time ago. Around rfc3270 days so early 2000s. My memory is hazy and my email archive off-line.

All objects in RPKI are in ASN.1 as are SNMP. I only have to deal with the former these days.

So, my take is that depending on canonical encodings in security protocols is a mistake. What one ends up doing is something like:

- "hmm, I've a decoded struct here, and a signature of its original encoding, and I have to validate that signature somehow... what do I do??"

- and then "ah, I know! I'll re-encode that struct and then I can validate the signature!!1!",

- but now you need a canonical encoding ruleset, otherwise if the signer had any liberties at all in the encoding, you will have interoperability problems!

And it turns out that specifying and -worse- implementing canonical encodings can be hard. Think of a canonical JSON... Let's say you have a JSON encoder lying around, and now you need to make it emit canonical JSON. You start by eliminating interstitial whitespace and you are ready to declare victory when you notice that you still need a canonical encoding of numbers, and also strings! Ok, now you have less-obvious design choices to make. Worse, adjusting your floating point number printer to emit canonical numbers turns out to be really hard, and there are a lot of traps in doing that. So maybe you decide you're going to limit yourself to integers. And it's all like this.

There is a better answer. The Heimdal ASN.1 compiler has a --preserve-binary=TYPE option where you can say that you want the decoder to preserve the original encoding of the give TYPE(s) so that you can validate signatures later. The way this works is that for each such TYPE, the compiler adds a `_save` field that has a copy of the encoding of that type as it was seen by the decoder.

I'm with Stephen Kent on this. I don't like the OpenSSH certificate format, for example -- it's missing important things and it's not that much simpler than the PKIX certificate format. The OpenSSH certificate format is much less bloaty than the PKIX one because PKIX uses DER and OpenSSH doesn't -- but so what, one could simply use an OER encoding of PKIX certificates and get the same de-bloating benefit with much less churn to existing codebases.

> You might have heard of similar such abstract syntax notations used for interface definitions such as Google Protocol Buffers, or Facebook’s Apache Thrift, but those languages have not been managed by a standardization organization, so the owning corporations could (in theory) make breaking changes or change the license or even remove the language definitions overnight.

Is this really the main difference between ASN.1 and Google protobufs, that one is managed by a private corporation and the other by a standardization organization? Can they otherwise be used "interchangably" in designing interfaces, a la two different programming languages (with different syntax of course)?

ASN.1 struggles because the word "ASN.1" can name a lot of different implementations with different nuances, and a "complete" ASN.1 implementation is a massive and hazardous undertaking which has left many with a sour taste. Meanwhile, ProtoBufs and Thrift work off of more constrained and well-versioned interfaces.

Honestly, ASN.1 with semantic versioning at the protocol level would probably have been as robust and useful as Protobufs. If ASN.1 had been forked into "ASN.1 3.0 without 10 hazardous and awful 1980s text encodings," it could even be fairly palatable today. Whether the overly expansive nature of ASN.1 is a product of the committee / standards organization design or the timeframe in which it originated is certainly an interesting philosophical question.

I used a subset of ASN.1 for a project, and it worked quite well.

ASN.1 versioning in particular is a work of art.

> Meanwhile, ProtoBufs and Thrift work off of more constrained and well-versioned interfaces.

Not so. Protocol buffers is just a TLV encoding, which is bad (see elsewhere in this thread) -- it's just a cut-down ASN.1 and variation on BER, so what.

ASN.1 can "well-version" everything just as well as anything else.

If I have two "proto3" implementations using the same definitions, I trust they work together, generally speaking.

If I have two ASN.1 BER implementations, I sadly can't really trust they work together, because I don't know what parts of "ASN.1" each one implemented.

ASN.1 is a specification. DER/BER were methods to encode that specification.
In terms of tooling, there’s excellent tooling for ASN.1 for C and C++ and maybe some other languages. There’s excellent tooling for protobufs for a handful of languages too, but they’re different sets, so in practice what languages you want to use would likely come into play.
How excellent the ASN.1 tooling is depends on which subset of ASN.1 you're using. Some of the tooling supports one iteration of ASN.1 or the other. To the degree that the IETF had to write a document on how to deal with this since some of the standards use the older ASN.1 and some use the newer ASN.1: https://tools.ietf.org/id/draft-ietf-pkix-asn1-translation-0...

Interoperability with ASN.1 is very fragile at best.

BTW, that I-D is now RFC 6025 [0].

There's also RFC 5912 [1], which adds x.681/x.682/x.683 constraints to PKIX modules. I use this to great effect in Heimdal[2]. One function call can decode everything in a certificate, and a second can pretty print it in JSON; one command can pretty-print a certificate in all its glory in JSON.

  [0] https://datatracker.ietf.org/doc/html/rfc6025
  [1] https://datatracker.ietf.org/doc/html/rfc5912
  [2] https://github.com/heimdal/heimdal
      https://github.com/heimdal/heimdal/tree/master/lib/asn1
We have tons of interoperable PKIX implementations (OpenSSL and derivatives, NSS, OpenJDK's, GnuTLS, wolfSSL, Heimdal, and many many more), and a bunch of interoperable Kerberos implementations (MIT Kerberos, Heimdal, Windows / AD, OpenJDK's, the IBM Java's, GNU Shishi, there's a python implementation).
> In terms of tooling, there’s excellent tooling for ASN.1 for C and C++ and maybe some other languages. There’s excellent tooling for protobufs for a handful of languages too, but they’re different sets, so in practice what languages you want to use would likely come into play.

In my experience, tooling is actually very good for most commonly-used languages, including C/C++, C#, Java, Python, and maybe even Go. And, of course, erlang. The real challenge is, I think, that you cannot find good free tooling, and the barrier to entry for Joe Developer is fairly high (in the thousands of dollars).

I concur. Furthermore, the pricing if often opaque: you know the tools are expensive when they always want you to contact sales for a quote.
> Is this really the main difference between ASN.1 and Google protobufs, that one is managed by a private corporation and the other by a standardization organization? Can they otherwise be used "interchangably" in designing interfaces, a la two different programming languages (with different syntax of course)?

No, the two are not interoperable and probably won't be made that way. Protobuf has undergone changes that challenge its backwards-compatibility (e.g., with item presence). ASN.1 supports multiple encoding rules, and while it's possible that someone could map ASN.1 syntax to protobuf encodings, it would only support a subset of ASN.1 because protobuf doesn't support length or value constraints (among other ASN.1 features).

ASN.1 does have a little-used standard called Encoding Control Notation[0] that in principle supports the construction of novel encodings. But I have never seen a compiler, commercial or otherwise, that supports it. It requires a certain expressiveness in your parser that's hard to do right, although I've wondered if LISP or Racket could take it on.

[0]: https://www.itu.int/rec/T-REC-X.692-202102-I

Protocol buffers is a tag-length-value encoding. It's got all the problems that DER and CER have. It's what happens when people decide to reinvent a wheel they don't understand.
What are the issues with TLV? I guess one could be that it's difficult to modify messages. On the other hand skipping parts of a message is efficient.
You should write a blogpost at this point.

You can write more about these problems and it would have higher visibility.

Good point. I might, yes.
What’s so great about ASN.1 and it’s encoding rules is that anyone writing type-length-value serialization for networking purposes, for example[1], is basically independently reinventing ASN.1 because it’s so fundamentally optimal.

It truly will make you wonder why Protobufs and others exist.

[1]: https://github.com/Planimeter/grid-sdk/blob/master/engine/sh...

> What’s so great about ASN.1 and it’s encoding rules is that anyone writing type-length-value serialization for networking purposes, for example[1], is basically independently reinventing ASN.1 because it’s so fundamentally optimal.

The challenge arises if you have very large values: by nature, TLVs require that the V be encoded before you can plug in the L. If you use definite-length encodings (as required by DER), you may end up having to hold and encode a pretty large piece of data in memory. You can work around this, of course, but it can be a challenge.

Tags in ASN.1 as noted in another comment can also be pretty complicated: there are four tagging classes, and tags can be applied implicitly, explicitly, or automatically depending on the specification. This can make life a bit uncomfortable at times.

On the balance, I can understand why people find ASN.1 such a pain, especially if you're not inclined to fork over money to have someone else deal with the encodings. For medium- to large-sized companies, though, it's probably not a bad deal: get a support contract from one of the commercial vendors, get training, and save yourself six man-months on writing pretty bullet-proof serialization code without the headache of worrying about standards incompatibilities. If you happen to work in telecommunications or security, you're going to deal with ASN.1 at some point anyway, so having something that can talk to multiple parts of your stack can be helpful, too.

That there's four tag classes is not really a complexity. That there's IMPLICIT and EXPLICIT tagging is.

Using IMPLICIT tagging yields encodings that dumpasn1(1)-like tools can't really give you much insight into.

Using EXPLICIT tagging yields bloat.

The answer is to use non-TLV encodings where possible and to use tools that can refer to the schema ("modules") to decode and pretty-print arbitrary things. dumpasn1(1) is just too simple.

[2]: https://github.com/openssl/openssl/issues/4320 I recently had to deal with that situation too.
I don't agree with the sentiment that TLV is optimal, but the assertion that people are constantly reinventing ASN.1 is most definitely true!
Back when I was in school in 2004, I had a teacher who had worked on the ASN.1 spec.

In 2004, XML was all the rage. People would create "XML startups", and Microsoft did SOAP and some other guys XHTML, and XML schemas, semantic web and so on.

I remember that teacher being so upset that XML got big and ASN.1 disappeared. It was very awkward. Poor guy...

Two very funny things happened then:

a) ASN.1 got XML Encoding Rules (XER), so you can use XML w/ ASN.1 as the schema language, which really, mostly is about supporting existing ASN.1-based protocols but with XML because well, you know, XML was all the rage,

and

b), FastInfoSet happened, which is an ASN.1 PER-based "compression" of XML because well, you know, XML is too verbose and unwieldy.

I [bleep] you not, that happened.

Evidence that there's nothing wrong with ASN.1 the syntax (and that's all it is, syntax and semantics, with a side of pluggable encoding rules where you can make them all up the way you want). Everything that's wrong with ASN.1 is either that which is wrong with BER/DER/CER (plenty), or that which is wrong with people's perception of ASN.1 (also plenty).

Our (computing) history is littered with better technology that was overtaken by "worst".

I wonder if your teacher eventually understood why XML was preferred over ASN1. Seems to me like it was easier to pick up, and harder to mess up.

Proto proto buf.
> Can the veterans of the 90s SSL Wars explain the issues with ASN1/DER/BER? Looking it up today, it seems like a pretty smart and extensive serialization system, and I have to wonder why new systems like Google Protobufs chose to reinvent the wheel.

Not a SSL or 90s veteran etc but:

- ASN.1 is how OSI was try to jump on object orientation bandwagon - inheritance via text files declarations + types OIDs registration requirement via single entity somewhere on Earth...

- ASN.1 is part of OSI - scientifically-correct attempt of networking standarisation

- ASN.1 is part of OSI which was miraculously dropped exactly when Cold War ended and replaced by much simpler TCP/IP and friends. modulo security parts - that still need to use INSANE formats for passing numbers and strings...

- ASN.1 implementations are guaranteed to be bugged for decades, imo and observations