Hacker News new | ask | show | jobs
by haberman 5508 days ago
From my quick scan of the thesis, the basic design seems to be a programming language in which you write both the parser and any transformations you want to perform. It's not clear whether there is an easily-accessible parse tree serialization that you can use to load the output into another language, or whether you'd have to invent that yourself.

I think it's generally a hard sell if you try to convince people that they need to write their algorithms in your special language. Parsing tools deliver value because grammars are easier to write than the imperative code that implements those grammars. That value offsets the cost of having to learn a new special-purpose language. But imperative programming languages are already pretty good at tree traversal and transformation, so there's little benefit to using a special-purpose language for this.

I think that the next big thing in parsing will be a runtime that easily integrates into other languages so that the parsing framework can handle only the parsing and all of the tree traversal and transformation can be performed using whatever language the programmer was already using. This requires much less buy-in to a special-purpose language.

3 comments

Colm has built-in serialization. There is still some work to do in this area though. Colm will preserve whitespace for minimal disruption of untransformed text, but figuring out what to do at the boundaries between modified and unmodified trees can be tricky.

You are right, people want to use general purpose languages for the more complex algorithms. I agree a means of embedding is necessary and I have kept this in mind, though not yet achieved it. I would very much like to be able to parse, transform, then have the option to import the data into another environment and carry on there.

Thanks for the info. What is the built-in serialization format?
Just plain old text as it came in. I see now that is not what you were referring to. You're talking about JSON, XML, etc I now think.

There is also a print_xml function, which puts the tree into XML, but it's mostly used for debugging at this point, not export to other systems. I'm hoping that with time these kinds of features will crop up.

AntLR can do this, although it does not work that well. I used the C backend, which is pretty directly ported from the Java backend. C-in-Java-style is pretty awkward.
ANTLR is not the same as what I am describing. ANTLR generates code in each target language: I am talking about a common runtime that all languages call into. Think of it as a "parsing VM." Using this scheme, there would be no need to have separate backends for C and Java, the only thing you'd need to port is the bindings.
Why would that be a good idea? What advantage would a just-in-time parser generator have that a static parser generator does not?
Fast parsing from any language (even slow languages like Ruby) without having to compile and link generated C code for every grammar into the interpreter.

With this approach, you could have a C extension that could load any grammar at runtime and parse it extremely fast.

C/Assembly code is orders of magnitude faster at parsing than generating eg. Ruby that does the parsing.

What you're describing sounds like the Gold parser: http://www.devincook.com/goldparser/
Yes, or my own project, Gazelle: http://www.gazelle-parser.org/