Hacker News new | ask | show | jobs
by lifthrasiir 1896 days ago
> I wrote too much I know but my point is TBCA is not an specific algorithm like gzip or LZW it is a sub branch of compression like an universal set.

Yes, TBCA is a scheme not a specific algorithm (I thought I was clear enough in the reply, but sorry if it wasn't). In fact I've actually done the same thing with my own database as well in a semi-automated fashion based on a pattern. For example I had a string with three parts: a number 1--214, an optional ', a number -9--99. My code accepts a pattern `[1..214]['?].[-9..99]` and generates a code that packs this into 8 + 1 + 7 = 16 bits. This works because I was dealing with the Unicode Character Database (the example being kRSJapanase), so I knew its exact pattern without an exception and had no migration problem since upgrading to newer UCD is a non-trivial problem anyway. Also I wanted to put the entirety of UCD to the shared memory, so I controlled most of the data structure to make this compression actually worthy.

My issue with TBCA presented in this way is that it looks like a direct replacement to general compression algorithms when it isn't. I regard this as a database schema because it is akin to RDBMS normalization: if you compress a name "<given> <sur>" (or vice versa) in this way, you can equally have a separate name parts table and two indices in the original table. The only difference is that you have hard-coded that name parts table into your app. I believe you should instead have done caching, so name parts table still exists but you can refer to the memory if you can. That makes a better general approach than TBCA, and also shows that it can't be compared with compression algorithms.