|
|
|
|
|
by lifthrasiir
1894 days ago
|
|
I can't tell if this post is presenting a trivial idea as a novel one or not. Typical compression algorithms indeed are not good at compressing a short text, but only because they have no prior knowledge about the input distribution. There are several ways to feed that knowledge to existing algorithms (e.g. zstandard custom dictionaries). The "Type-Based Compression Algorithm" presented in the post does not do any kind of distribution modelling. It simply assumes that every possible input is uniformly distributed, hence incompressible, and that's obviously not true. The post gives an example of Turkish car number plates; its first two digits indicate the province code, so you will see certain two-digit prefixes more frequently. TBCA can't see this distribution. TBCA can also try to limit the range of first two digits for the better coding, but it will break with a new province code. Typical compression algorithms might be inefficient for those inputs but can surely handle them. This is perhaps the biggest drawback of TBCA: a wrong assumption of the input data results in a hard failure (unless you have an escape code). It is not the same kind of algorithm you can compare with general compression algorithms. It is just a clever database schema that gives some compression without those general compression algorithms. |
|
For example, we can think a well-developed city carries all data related to the city and the people to a big database. Then this city uses a TBCA that specialized for only the this big database's needs like a framework or engine method. However, this specific TBCA is not totally different than other TBCAs used in other types of databases like a game database since they have common propertries like people's names and surnames, the structure of a name database is generally same but TBCA plays huge rol in here, you can configure your algorithm with your needs like an optimizing. Today I am not sure how it should be done, maybe with an ML algorithm.
I wrote too much I know but my point is TBCA is not an specific algorithm like gzip or LZW it is a sub branch of compression like an universal set.
In the future, There may communities share their specificated algorithms for some structures and their datasets( frequency analysis). It will becomes a pool that you can choose best engine(structural method) and best dataset(freq. analysis) from there.