Hacker News new | ask | show | jobs
by cxie 467 days ago
Just spent my morning diving into succinct data structures after seeing this. The memory efficiency is incredible - especially the balanced parentheses tree representing a full node tree in just 2 bits per node! I've been working on a project parsing large (10GB+) XML files for scientific data analysis, and our current approach burns through RAM like crazy. Has anyone here successfully implemented these structures in production systems?

The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.

This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.

4 comments

When I’ve dealt with huge files in .NET, the usual approach is to stream the file such that only some of it is in memory at once. This way you can process files hundreds of GBs. Of course, if you truly need them all in memory at once for some reason I genuinely can’t think of, then you’d need something else.

Does your language have the concept of streaming files?

If you're streaming something row-based like a CSV, or a zipped CSV, then that's usually easy.

But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.

Nothing prevents streaming in theory, it's just far more complicated to write that library.

protobuf sure, but streaming libraries for json (and xml, as in the parent) are extremely common. not harder (maybe even easier) than non-streaming to write, tho more cumbersome to use, so something you'd reach for only if you specifically need it ('cuz of memory constraints)

e.g. standard go json library https://pkg.go.dev/encoding/json#example-Decoder.Decode-Stre...

Yup. I don't remember streaming JSON being common in the early days but now it is. But the absence of streaming protobuf is what has killed me, when dealing with gigantic protobuf files from government agencies (ugh).
Heh, yeah. The protobuf people's expectation was if you had a really large dataset you'd wrap it up in your own mini-protocol of "sequence of protobuf messages". But of course that's way more friction, so in practice it will end up not getting done when it should be (plus also, it requires a certain amount of ability to predict the future).

Lesson for technologists: if you want to make the world a better place arrange your tech such that the lowest-friction path is also the correct path.

(Another example: disasterous multi-byte UTF encodings [correct solution was more friction] vs basically successful UTF8 [correct solution was less friction].)

I don't know if you're still dealing w/ this particular problem for protobufs, but based on my experience with thrift, a very similar library, there are probably some not too terrible ways you can kinda hack up the client-side parsing to be more streaming-ish...

nanopb is designed around streaming. It's limited in a few ways[1] but is designed for use on low-memory systems (microcontrollers) where the whole protobuf message won't necessarily fit into memory at once. Might not help for your use cases though, since it's a C library without a stable ABI.

[1]https://jpa.kapsi.fi/nanopb/docs/#features-and-limitations

In programming languages suitable for enterprise software development there are blessed streaming parsers for XML, because it's a rather common task.

It's very common that other programming languages have basic SAX parsers.

What are these languages that don't which you've encountered?

The low hanging fruit in this area is to do partial unmarshalling or using a SAX parser on a stream. It's likely you'll have to do this to retrieve data and put it in whatever succinct or otherwise efficient data structure.

In Java, which I consider to have the best tooling for advanced XML applications, you'd look into JAXB on streams, StAX or SAX. On complicated and heavily nested XML it might take some effort and profiling to figure out the optimal state machines for exhaustive traversal, if that's what you do.

I'd also like to mention that XSLT is an often underappreciated approach.

There's a create blog from the creator of RhymeBrain that talks about Succinct Tries: https://stevehanov.ca/blog/index.php?id=120

I'm pretty sure these were used to store the built in dictionaries on early mobile phones, especially for the implementation of T9 word and similar programs.

This might be a little over my head, but i'm not understanding how the balanced parenthesis is conveying anything other than the topology of the tree structure. Are we not accounting for the bits required for a pointer in memory to an object? Or simply the bits required to guarantee the uniqueness of a node in the tree?
You store the structural information separately from the data. The data can be stored sequentially in some traversal order.
He touches on indexes but doesn't really mention the implementation. This is about the primitives.