Hacker News new | ask | show | jobs
Shoco: a fast compressor for short strings (ed-von-schleck.github.io)
31 points by multipass 3969 days ago
6 comments

I have a background project of exploring how to compress SMILES strings, which is a notation for storing chemical information. For example, "C" is methane, "CC" is ethane, "C=C" is ethene, "CCO" is ethyl alcohol, "C1CCCCC1" is cyclohexane, and "c1ccccc1", which contains aromatic carbons, is benzene. The average length of a SMILES string for real-world molecules is about 50 characters.

I previously evaluated a special purpose tool which identifies the best n-grams and uses dynamic programming during encoding. That gets about 70% compression on SMILES string. I also tried the off-the-shelf femtozip which got about 60% compression but had more decompression overhead than I like.

Shoco, trained on 1,455,763 SMILES strings (average of 56 letters each), and tested with 100,000 strings from the training set, reports "average compression ratio: 47%".

Could you provide more information about your SMILES test? How many unique symbols were there? How does gzip do? This is an interesting use case.
Sure. I'm switching this conversation to email though, using the gmail account in your profile. Short version is, I trained it on the RDKit-generated SMILES strings from ChEMBL-20. Three of the strings look like this:

    CC(C)=CCC/C(C)=C/C=C/C(=O)N1CCCC1
    CC(=O)NC(C(=O)N1CCSCC1)[C@H]1CC(C(=O)O)C[C@@H]1N=C(N)N
    O=C(CC(c1ccc(F)cc1)(c1ccc(F)cc1)c1ccc(F)cc1)N1C[C@H](O)C[C@H]1C(=O)N1CCC[C@@H]1C(=O)NC[C@@H]1CCCNC1
On the raw data set (on record per line), wc reports:

     1455763 1455763 82882385
while | gzip -c | wc -c reports 18773892.
> I'm switching this conversation to email though

I wish you wouldn't do that. That defeats the entire point of a website such as this. Just because you don't think that this is interesting to random people doesn't mean that random people don't think this is interesting.

The email I sent was 178K long, with 1000 real-world examples (to get an idea of the character distribution), and the .h file model generated by shoco on the entire data set.

Assuming that bmh100 is both interested in working on this and doesn't have the domain knowledge, I gave a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar. (Each method requiring less domain knowledge and more CS experience.)

This this is a big chunk to chew on, and this is the weekend, I figure it will take a few days to digest and be able to response. Since HN doesn't have notifications, how long should I actively check this thread for replies?

By sending email, I also invite a response after a couple of months, should that be the case. (I yesterday got a followup on a topic that was 4 years old.) So no, supporting these long-term research exchanges is not one of the main goals of HN.

You'll note that I also answered what bmh100 asked for here. If you find it interesting, then feel feel to ask interesting questions.

> "You'll note that I also answered what bmh100 asked for here."

Oops! I forgot information about the unique symbols. Here's each unique symbol and its count.

   'c' 17253540     'S'   432387     '8'     5960
   'C' 14942972     'l'   373194     'i'     3307
   ')'  7687638     '/'   342041     '9'     2354
   '('  7687638     's'   164740     '0'     1457
   'O'  5116146     'o'   161862     'e'     1421
   '1'  4388585     '+'   137743     'K'     1280
   '2'  3327339     '5'   119946     'L'      442
   '='  3311807     '.'    98163     'A'      287
   'N'  2949791     '#'    85482     'Z'      147
   '@'  2241799     'B'    82464     'b'      122
   '['  1952216     'r'    79603     'g'       87
   ']'  1952216    '\\'    78388     'M'       64
   'n'  1707728     'P'    42036     'T'       48
   '3'  1597703     '6'    29852     'p'       37
   'H'  1514805     'I'    15180     't'       21
   '-'   521529     '7'    11771     'R'       12
   'F'   497963     'a'    11311     'V'        8
   '4'   485537     '%'     6450     'X'        3
and the first 80 most common bigrams and last 19 of the 833 total (using "\n" to indicate "end of word") are:

  'cc' 7981192   ')N'  998239   '2C'  458454   'N1'  245581   ... many lines omitted ...
  'CC' 3965060   ']('  943201   ')O'  456391   'n1'  228523   '91'       1
  'O)' 3141345  '1\n'  892805   'F)'  456060   'O='  228415   '#S'       1
  'C(' 2841639   'NC'  850938   'C2'  426474   '3C'  223729   'lS'       1
  'c1' 2792900   '2)'  793573   'N('  413951   'C='  217307   '03'       1
  '(C' 2596059   '(O'  779762   '3)'  410549  'O\n'  211774  'P\\'       1
  '=O' 2496004   'Cc'  764483   '1)'  406001   '/C'  207957   'PN'       1
  ')c' 2489724   'OC'  764313   'Cl'  373140   '[n'  205910   'PO'       1
  'c(' 2444133   'CN'  744679   '-c'  362109   ']2'  203805   'K]'       1
  '(=' 2296603   '@@'  732946   'Oc'  358157   'C3'  202290   'p3'       1
  'c2' 2105356   ')['  705999   ')n'  353823   ')='  196635   'I1'       1
  ')C' 1692206   'nc'  699014   'cn'  345084   'S('  196491   'i@'       1
  '[C' 1513171   'CO'  675894   '(F'  344658   '2n'  188360   'II'       1
  'H]' 1512910   '1C'  660311   '(c'  344068   'n2'  180310   'i-'       1
  'C@' 1507657   '=C'  611606   'N)'  332280   'nH'  177385   ']I'       1
  'C)' 1489856   '3c'  598751   ']1'  308170   '1n'  174940   'Bc'       1
  '1c' 1463358   '(N'  595762   'Nc'  298653   '@]'  174511   'Bi'       1
  '@H' 1333765   'C1'  588362   'c4'  291931   '4c'  173949   'B.'       1
  '2c' 1204003   ')('  504595   '(-'  290801   '1O'  173802   'H7'       1
  'c3' 1031134   'C['  478529   'l)'  289924   'N['  170445   '.n'       1
Not something I would think is particularly interesting for an HN post.
If it's not confidential (and I am assuming it isn't), why not just link it in a gist or something? That way other people can also take a crack at it.

Among other things, "a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar" is something I would be interested in. And, considering the upvotes I got for my grandparent comment, something that other people would be interested in as well.

Also: http://hnnotify.com/

Look how well it can compress "fofofofofofofofofofofo".

50%

Look how well it can compress "ababababababababababab".

0%

Will test against smaz for our internal JSON compressed protocol. smaz compressed fine but was too slow. The ability to train the model sounds convincing.
I get negative compression percentage when I put words with "é" in the test box.
In default it doesn't work well with non-ASCII characters.

https://ed-von-schleck.github.io/shoco/#how-it-works

Between this (an ASCII-only compressor in 2015?) and the other aspects brought up here, it seems downright toylike.
I wonder what'd happen if you used this on base64 strings.
I would love to see a blog post about that test, if you're willing.
I can't tell you how many times I've said to myself "if only these very short ASCII strings were even shorter!"
At scale, and if you are paying for transmission costs, it can have a massive impact.