Hacker News new | ask | show | jobs
by jhumphries131 1377 days ago
It's possible that the internal version of protoc is very different from the open-source version. (I know there are numerous differences, but not sure how pervasive they are in the parser.)

The open-source version has a hand-written tokenizer and recursive descent parser that is not too difficult to translate to EBNF. You'll notice that the section on numeric literals is a little wonky, because the tokenizer does a check that is hard to describe in EBNF. But it isn't too bad.

Also, some of the constraints of the language are in prose in this spec because they are easier to enforce using a semantic validation pass, instead of trying to model purely with a CFG. (Optionality of the colon in the text format, used in message literals, comes to mind.)

There are some things that technically _could_ be handled in the grammar, but they would make the grammar much more cumbersome to read and understand. So those things are also extracted into prose.

> Definitely interesting for this company to create an EBNF definition for protobuf.

For what it's worth, Google has also published an EBNF definition (the subject blog post contains links to those specs). But they are incomplete and not entirely accurate, which is a non-trivial part of what led us to writing and publishing this spec.

1 comments

One place protoc doesn't align well is the descriptor object. https://developers.google.com/protocol-buffers/docs/referenc...

Comment placement is basically allowed anywhere by protoc, but how to get those comments within a Descriptor object for a proto is not well defined (there are places where you can put comments that are not available within Descriptor). It provides leading/trailing comments, but there are many other cases that are missed today (like comments embedded within a list of items in an array). Maybe this is a mismatch between what protoc allows and what Descriptor presents, but it's definitely annoying.

I agree! Protoc allows comments anywhere, but it doesn't bother preserving them all in the descriptor. At one point I thought this was a bug, since comments _could_ be preserved in far greater contexts (though definitely not all). But then I realized that the descriptor.proto comments do actually state the places where comments are retained:

> If this SourceCodeInfo represents a complete declaration, these are any > comments appearing before and after the declaration which appear to be > attached to the declaration.

https://github.com/protocolbuffers/protobuf/blob/v21.5/src/g...

After re-reading with this caveat in mind -- that comments are only preserved before and after complete declarations -- I realized that it does retain the comments that are expected.

Long story short: the descriptor is bad as an AST if you care about recovering the original source. The descriptor is lossy. I'd recommend using a real AST for use cases that want something non-lossy: https://pkg.go.dev/github.com/jhump/protoreflect/desc/protop...

I totally forgot about .proto comments. I spent a while re-implementing that [1] for the IntelliJ editor and even longer testing it [2]. Apparently I opened a bug while doing so; is b/33539835 still open?

1: https://github.com/jvolkman/intellij-protobuf-editor/blob/ma... 2: https://github.com/jvolkman/intellij-protobuf-editor/blob/ma...