Hacker News new | ask | show | jobs
by kronxe 1896 days ago
My one of the main goals is protect the database structure as it is, because in big data age we should get data as fast as possible. I see your point to see this algorithm as a basically some sort of look-up table but actually it is not.

For example, we can think a well-developed city carries all data related to the city and the people to a big database. Then this city uses a TBCA that specialized for only the this big database's needs like a framework or engine method. However, this specific TBCA is not totally different than other TBCAs used in other types of databases like a game database since they have common propertries like people's names and surnames, the structure of a name database is generally same but TBCA plays huge rol in here, you can configure your algorithm with your needs like an optimizing. Today I am not sure how it should be done, maybe with an ML algorithm.

I wrote too much I know but my point is TBCA is not an specific algorithm like gzip or LZW it is a sub branch of compression like an universal set.

In the future, There may communities share their specificated algorithms for some structures and their datasets( frequency analysis). It will becomes a pool that you can choose best engine(structural method) and best dataset(freq. analysis) from there.

1 comments

> I wrote too much I know but my point is TBCA is not an specific algorithm like gzip or LZW it is a sub branch of compression like an universal set.

Yes, TBCA is a scheme not a specific algorithm (I thought I was clear enough in the reply, but sorry if it wasn't). In fact I've actually done the same thing with my own database as well in a semi-automated fashion based on a pattern. For example I had a string with three parts: a number 1--214, an optional ', a number -9--99. My code accepts a pattern `[1..214]['?].[-9..99]` and generates a code that packs this into 8 + 1 + 7 = 16 bits. This works because I was dealing with the Unicode Character Database (the example being kRSJapanase), so I knew its exact pattern without an exception and had no migration problem since upgrading to newer UCD is a non-trivial problem anyway. Also I wanted to put the entirety of UCD to the shared memory, so I controlled most of the data structure to make this compression actually worthy.

My issue with TBCA presented in this way is that it looks like a direct replacement to general compression algorithms when it isn't. I regard this as a database schema because it is akin to RDBMS normalization: if you compress a name "<given> <sur>" (or vice versa) in this way, you can equally have a separate name parts table and two indices in the original table. The only difference is that you have hard-coded that name parts table into your app. I believe you should instead have done caching, so name parts table still exists but you can refer to the memory if you can. That makes a better general approach than TBCA, and also shows that it can't be compared with compression algorithms.