Hacker News new | ask | show | jobs
by Scaevolus 2536 days ago
These ideological decisions don't sound very pragmatic. There's a lot of open-source prior art in this space (OpenGrok, Kythe, SourceGraph) which provide support for most large languages and have annotation output formats that are broadly similar to this JSON file, and you could still support users having indexers for small languages running as part of CI.

> There does not exist any widely available standalone C parsing library to provide C programs with access to an AST. There’s LLVM, but I have a deeply held belief that programming language compiler and introspection tooling should be implemented in the language itself. So, I set about to write a C parser from scratch.

Even if you prefer to write your C indexer in C, you could use LLVM's C [1] or Python [2] APIs. Plus, you can handle C++ without having to implement your own C++ parser from scratch, which is a much larger undertaking than C99 plus a few GNU extensions.

[1]: https://github.com/llvm-mirror/clang/blob/fb2a26cc2e40e007f1... [2]: https://github.com/llvm-mirror/clang/blob/master/bindings/py...

2 comments

One problem with OpenGrok et al is scale. I already have a service which is designed to run arbitrary user tasks in an environment configured for their project's needs, so I wanted something that could take advantage of that.

As for parsing C++, since LLVM is written in C++ using it to write a C++ annotator would be a natural fit :) But C and C++ are different langauges and I don't wish to require LLVM to deal with it. LLVM is one of the largest open source projects on the net, and it requires a lot more complexity and compile time to utilize under these circumstances. On the other hand, I came up with a solution which is <1,300 lines of code and won't grow much more as it expands to support a broader set of C extensions.

There does exist prior art, but I deliberately chose to go with the lowest common denomoniator to provide support for a lot of use-cases we can't predict in an environment which gives users more control over its behavior. I think over time it will be pretty easy to plug the prior art into this system, but harder to plug their systems into novel use-cases. The existing solutions are not always the best, but I did put in a lot of research time to validate that assumption.

Github also recently open-sourced their Haskell-based Semantic, which annotates and cross-references a whole bunch of languages (all the languages any of our clients use), and is built on tree-sitter, so there's, like, several levels of prior art available here.
Another issue with Semantic that makes me less thrilled about using it here: say it doesn't support programming langauge $x, and you don't know or want to use Haskell, but you do know and want to use language $x. To add it to GitHub, you have to learn Haskell, which is no small mountain to climb. To add it to SourceHut, you can just leverage $x's existing tools.

But, plugging Semantic into SourceHut should be totally possible with some mild massaging of the output JSON.

Semantic is in Haskell, but since it doesn't use GHC, it cannot handle Haskell well (can't e.g. resolve type classes).

If I wanted good code reviewing with Haskell, I figure it would be best to translate HIE files (https://www.haskell.org/ghc/blog/20190626-HIEFiles.html) to LSIF (https://github.com/mpickering/hie-lsif), which is supported in VSCode. Because of the limitations of only parsing, GitHub alone will not be as powerful. If I then just make an LSIF to SourceHut converter, SourceHut will have better annotations than GitHub...