Hacker News new | ask | show | jobs
by ynik 2522 days ago
> Why doesn’t every language give me a batteries-included digraph ADT?

I think such a type would be less useful than you'd think, for precisely the same reason why a linked list in the stdlib is pretty much useless: almost always, you don't want the stdlib allocating nodes; you want an intrusive data structure instead. Really, the problem is that the compiler needs to store extra data with the nodes that the stdlib can't know about, but you don't want two separate types `stdlib::GraphNode` and `compiler::ControlFlowNode` because you usually need to be able to convert between these types in both directions. (one direction can be handled by embedding one of the types in the other; but the reverse direction will require overhead for an extra pointer, or horribly unsafe pointer arithmetic)

Of course in Rust, there could still be a digraph trait and the stdlib could still provide generic algorithms.

Though it's also not so rare in compilers that nodes are members of multiple graphs simultaneously (with different edges in each, e.g. control flow nodes are typically not just part of the control flow graph, but also belong to a dominator tree). It's non-trivial to create a graph abstraction that can handle all these cases while remaining efficient (you don't want to put nodes in a HashSet just to check whether a graph algorithm already visited them), so it's not surprising that compiler developers don't bother and just write the algorithm directly for their particular data structures. In the end, most graph algorithms are only about a dozen lines, much simpler than the abstractions that would be required to re-use them across completely different graphs.

2 comments

> In the end, most graph algorithms are only about a dozen lines, much simpler than the abstractions that would be required to re-use them across completely different graphs

I guess that depends on the language; in dynamic, pure-functional languages (like Erlang) there’s nowhere you can go for further efficiency beyond “digraph expressed as labelled V and E sets held in dictionaries”, so the data structures themselves actually are quite generic, and you would never really need a separate specialization of them.

You’re probably right that in systems-level languages where you’re effectively programming a pointer / RAM word machine, you can get better specialized data structures per use-case, so writing the self-hosting compiler isn’t going to just “throw off” a handy reusable stdlib digraph library as a side-effect.

But in the cases where you’re not using a systems-level language—i.e. the cases where the graphwise algorithms you’re writing are being written against the same stdlib ADT anyway, even if you reimplemented it compiler-side—it doesn’t make sense to me to hide these algorithms away inside the compiler application package. They’re generic algorithms! (And since the ADT is a formal one from a paper, they’re probably very stable ones, too!)

As a restatement of my original premise: if your digraphs are generic, and topsort is a generic function on digraphs with only one obvious implementation given the digraph ADT you’re operating against; and given that you’ve implemented topsort for these stdlib digraphs; why aren’t you putting that algorithm in the stdlib, as a function of stdlib digraphs?

(And it’s not just a question of only putting the most common stuff into the stdlib. Erlang has lots of very arcane digraph operations only useful in proof-verification sitting around in the stdlib in the digraph_utils module. There’s no reason that operations only useful in compilers wouldn’t belong in there equally well. But instead, they’re in the compiler. Kinda weird, if you ask me.)

Yet Boos.Graph (a c++ library) is a generic library of graph algorithms that, in the spirit of the STL, abstracts away the graph representation. Whether the graph is represented via direct pointers or adjacency matrices (however represed) they can be adapted to work very efficiently with the library.
> you don't want the stdlib allocating nodes; you want an intrusive data structure instead.

Don't C++'s std::list and Rust's std::collections::LinkedList effectively implement the moral equivalent (insofar as memory layout is concerned) of an intrusive list, though?

Yes, but they hide the memory layout as an implementation detail, thus negating most advantages of an intrusive list.

You can't call `std::list<Node>::erase()` with a Node* , and you can't write a function that converts from Node* to `std::list<Node>::iterator` in standard C++, even though it's really just a matter of subtracting a fixed number of bytes from the pointer value. So you instead need to store the `std::list<Node>::iterator` in the `Node`, wasting memory.

You can kind of cheat by always using `std::list<Node>::iterator` instead of Node* , but that means you can't use normal methods, because the `this` pointer will lack access to the iterator. And you always need to pass around a pair of `std::list&` and iterator, because the iterator alone isn't sufficient to delete the node or insert elements before/after it. Usually the code ends up being a lot simpler if class Node just reimplements a linked list.

And that's still the simple case where each node is only contained in a single data structure at a time, which (at least in compilers) is rarely sufficient.

There's boost::intrusive::list which IME actually works quite well. There is a node type which elements derive from or contain, so you can get an iterator from an element, as well as the other way round. To include an element in multiple lists, just aggregate the node type multiple times.
Boost.Intrusive is beyond awesome. You can add nodes not only to multiple lists, but also to maps, set, hash_maps to build multi-indexed data structures (and with delete hooks all indices can be kept consistent).