Hacker News new | ask | show | jobs
by jlokier 1073 days ago
> Any examples you're thinking of?

The three I know about are all Ethereum account/state history stores.

Besu does something interesting they call Bonsai Trees. It reportedly uses a lot less storage (about 1/10, 1.1TB instead of 12TB) than what came before in certain modes that store all the account history ("archive node"), and a bit less storage in other modes. I haven't studied how Bonsai trees work, so I don't know if the method is useful for non-blockchain styles of account history.

Erigon is well known as using perhaps the smallest storage when storing all the account history, and this is what it's optimised for. Erigon uses a fairly classic B-tree, for performance reasons (lower IOPS than LSM-trees for relevant operations). Their tricks are in how the data is organised in the B-tree, being better than the classic Ethereum methods. Erigon 2/3/4 (not sure, they merged plans) uses a new storage format that is supposed to do the same thing better. I don't think Erigon's tricks would be applicable to TigerBeetle, because TB doesn't have the storage problem of keeping everythng in Merkle trees in the first place. But it's a place to look as Ethereum's leader for archive node storage.

Last but not least, I have a work in progress storage format that is more efficient than either of the above, in that it uses much less storage for all the account history (due to efficient queryable and updatable adaptive compression), and fewer IOPS to look up individual account states (because random access is important). It uses a novel storage tree structure that is neither a B-tree nor an LSM-tree, but has some aspects of both, in order to reach for lower IOPS in various scenarios that each of those trees ar better or worse at, and it is also compresses the data in a novel way, which is important to storing histories and patterned data efficiently. It was originally motivated by Ethereum state history storage (since there's a clear benchmark), but I'm thinking it could be useful for time series data.