| It's possible that the internal version of protoc is very different from the open-source version. (I know there are numerous differences, but not sure how pervasive they are in the parser.) The open-source version has a hand-written tokenizer and recursive descent parser that is not too difficult to translate to EBNF. You'll notice that the section on numeric literals is a little wonky, because the tokenizer does a check that is hard to describe in EBNF. But it isn't too bad. Also, some of the constraints of the language are in prose in this spec because they are easier to enforce using a semantic validation pass, instead of trying to model purely with a CFG. (Optionality of the colon in the text format, used in message literals, comes to mind.) There are some things that technically _could_ be handled in the grammar, but they would make the grammar much more cumbersome to read and understand. So those things are also extracted into prose. > Definitely interesting for this company to create an EBNF definition for protobuf. For what it's worth, Google has also published an EBNF definition (the subject blog post contains links to those specs). But they are incomplete and not entirely accurate, which is a non-trivial part of what led us to writing and publishing this spec. |
Comment placement is basically allowed anywhere by protoc, but how to get those comments within a Descriptor object for a proto is not well defined (there are places where you can put comments that are not available within Descriptor). It provides leading/trailing comments, but there are many other cases that are missed today (like comments embedded within a list of items in an array). Maybe this is a mismatch between what protoc allows and what Descriptor presents, but it's definitely annoying.