Hacker News new | ask | show | jobs
by lifthrasiir 1894 days ago
I can't tell if this post is presenting a trivial idea as a novel one or not. Typical compression algorithms indeed are not good at compressing a short text, but only because they have no prior knowledge about the input distribution. There are several ways to feed that knowledge to existing algorithms (e.g. zstandard custom dictionaries).

The "Type-Based Compression Algorithm" presented in the post does not do any kind of distribution modelling. It simply assumes that every possible input is uniformly distributed, hence incompressible, and that's obviously not true. The post gives an example of Turkish car number plates; its first two digits indicate the province code, so you will see certain two-digit prefixes more frequently. TBCA can't see this distribution.

TBCA can also try to limit the range of first two digits for the better coding, but it will break with a new province code. Typical compression algorithms might be inefficient for those inputs but can surely handle them. This is perhaps the biggest drawback of TBCA: a wrong assumption of the input data results in a hard failure (unless you have an escape code). It is not the same kind of algorithm you can compare with general compression algorithms. It is just a clever database schema that gives some compression without those general compression algorithms.

1 comments

My one of the main goals is protect the database structure as it is, because in big data age we should get data as fast as possible. I see your point to see this algorithm as a basically some sort of look-up table but actually it is not.

For example, we can think a well-developed city carries all data related to the city and the people to a big database. Then this city uses a TBCA that specialized for only the this big database's needs like a framework or engine method. However, this specific TBCA is not totally different than other TBCAs used in other types of databases like a game database since they have common propertries like people's names and surnames, the structure of a name database is generally same but TBCA plays huge rol in here, you can configure your algorithm with your needs like an optimizing. Today I am not sure how it should be done, maybe with an ML algorithm.

I wrote too much I know but my point is TBCA is not an specific algorithm like gzip or LZW it is a sub branch of compression like an universal set.

In the future, There may communities share their specificated algorithms for some structures and their datasets( frequency analysis). It will becomes a pool that you can choose best engine(structural method) and best dataset(freq. analysis) from there.

> I wrote too much I know but my point is TBCA is not an specific algorithm like gzip or LZW it is a sub branch of compression like an universal set.

Yes, TBCA is a scheme not a specific algorithm (I thought I was clear enough in the reply, but sorry if it wasn't). In fact I've actually done the same thing with my own database as well in a semi-automated fashion based on a pattern. For example I had a string with three parts: a number 1--214, an optional ', a number -9--99. My code accepts a pattern `[1..214]['?].[-9..99]` and generates a code that packs this into 8 + 1 + 7 = 16 bits. This works because I was dealing with the Unicode Character Database (the example being kRSJapanase), so I knew its exact pattern without an exception and had no migration problem since upgrading to newer UCD is a non-trivial problem anyway. Also I wanted to put the entirety of UCD to the shared memory, so I controlled most of the data structure to make this compression actually worthy.

My issue with TBCA presented in this way is that it looks like a direct replacement to general compression algorithms when it isn't. I regard this as a database schema because it is akin to RDBMS normalization: if you compress a name "<given> <sur>" (or vice versa) in this way, you can equally have a separate name parts table and two indices in the original table. The only difference is that you have hard-coded that name parts table into your app. I believe you should instead have done caching, so name parts table still exists but you can refer to the memory if you can. That makes a better general approach than TBCA, and also shows that it can't be compared with compression algorithms.