|
|
|
|
|
by dalke
3969 days ago
|
|
I have a background project of exploring how to compress SMILES strings, which is a notation for storing chemical information. For example, "C" is methane, "CC" is ethane, "C=C" is ethene, "CCO" is ethyl alcohol, "C1CCCCC1" is cyclohexane, and "c1ccccc1", which contains aromatic carbons, is benzene. The average length of a SMILES string for real-world molecules is about 50 characters. I previously evaluated a special purpose tool which identifies the best n-grams and uses dynamic programming during encoding. That gets about 70% compression on SMILES string. I also tried the off-the-shelf femtozip which got about 60% compression but had more decompression overhead than I like. Shoco, trained on 1,455,763 SMILES strings (average of 56 letters each), and tested with 100,000 strings from the training set, reports "average compression ratio: 47%". |
|