Hacker News new | ask | show | jobs
by willvarfar 2300 days ago
Hmm, a lot of effort to make documents small, but then storing it in mongodb?!?!

If size and performance are a focus, just store them in a normal sorted table with compression (e.g. leveldb, or mysql using rocksdb).

This means all these small documents can be compressed with repetition between games and not just within each game.

And probably much much faster and simpler etc.

Basically, the size taken by the database should be at the same kind of level as you get by just having a text file with one line per game, and gzipping the whole thing. I'd expect it to be an order of magnitude smaller than per-document compression.

3 comments

I wonder if you could store them as a trie, I feel like that would give you quite good savings given the commonality of openings. Or at least use a trie for the openings and then apply the Huffman coding to the rest of the game and store that separately.
Right. Every possible chess game as a sequence of legal positions can be uniquely enumerated. Per Wikipedia the game-tree complexity is around 10^123, so you'll need about a 124 digit number to represent them. This is as dense as it can possibly get.

On the other hand, lichess supports alternative chess modes and starting boards that may or may not be reachable from the standard initial configuration and legal moves, so this won't work for those use cases.

Postgres TEXT columns are pretty good for compression as well [1]

[1] https://www.postgresql.org/docs/current/storage-toast.html

Games are quite small -- under 50 turns, each under 2 bytes in binary, but more if you start with simple text languge before compression. You need to compress each game individually. Would database-wide compression work?
You do not need to compress every game individually. Behind the scenes, your mainstream database is storing the data in pages, and these pages - containing many adjacent rows - can be compressed.

The compression level achieved at a page level is much higher than compressing rows individually because there is lots of repetition between rows. It also speeds up io and other good things. It is almost always good, which is why the most modern database storage engines do it and do it by default.

Consider a csv file, and compare it to the same data stored as json objects, one row per line. The uncompressed json file is going to be much bigger, as the columns are repeated in every line. But both files gzip to much the same size, because all those keys are repeated again and again and the two files have basically the same entropy.

On the other hand, compressing each line in the file individually would be a poor choice, giving relatively poor gains.

There were database engines that did row-level compression, but these performed poorly and I know of nobody who used eg innodb compression.

I see several levels of compression:

1) First, as you mentioned, the limited symbol table for expressing all the moves.

2) But not every move is possible; you just need a way to order all the possible moves and then say the nth of those moves.

3) Even better: not all of those moves are equally likely. You can define (n cycles of) a chess playing engine that ranks all moves and reserves shorter symbols for the best moves (as judged within that time frame). Players tend to select higher ranked moves, so that’s another regularity to exploit.

Of course, 3) comes at the cost of compression/extraction time.