| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bmh100 3968 days ago
	Could you provide more information about your SMILES test? How many unique symbols were there? How does gzip do? This is an interesting use case.

1 comments

dalke 3968 days ago

Sure. I'm switching this conversation to email though, using the gmail account in your profile. Short version is, I trained it on the RDKit-generated SMILES strings from ChEMBL-20. Three of the strings look like this:

    CC(C)=CCC/C(C)=C/C=C/C(=O)N1CCCC1
    CC(=O)NC(C(=O)N1CCSCC1)[C@H]1CC(C(=O)O)C[C@@H]1N=C(N)N
    O=C(CC(c1ccc(F)cc1)(c1ccc(F)cc1)c1ccc(F)cc1)N1C[C@H](O)C[C@H]1C(=O)N1CCC[C@@H]1C(=O)NC[C@@H]1CCCNC1

On the raw data set (on record per line), wc reports:

     1455763 1455763 82882385

while | gzip -c | wc -c reports 18773892.

link

TheLoneWolfling 3968 days ago

> I'm switching this conversation to email though

I wish you wouldn't do that. That defeats the entire point of a website such as this. Just because you don't think that this is interesting to random people doesn't mean that random people don't think this is interesting.

link

dalke 3967 days ago

The email I sent was 178K long, with 1000 real-world examples (to get an idea of the character distribution), and the .h file model generated by shoco on the entire data set.

Assuming that bmh100 is both interested in working on this and doesn't have the domain knowledge, I gave a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar. (Each method requiring less domain knowledge and more CS experience.)

This this is a big chunk to chew on, and this is the weekend, I figure it will take a few days to digest and be able to response. Since HN doesn't have notifications, how long should I actively check this thread for replies?

By sending email, I also invite a response after a couple of months, should that be the case. (I yesterday got a followup on a topic that was 4 years old.) So no, supporting these long-term research exchanges is not one of the main goals of HN.

You'll note that I also answered what bmh100 asked for here. If you find it interesting, then feel feel to ask interesting questions.

link

dalke 3967 days ago

> "You'll note that I also answered what bmh100 asked for here."

Oops! I forgot information about the unique symbols. Here's each unique symbol and its count.

   'c' 17253540     'S'   432387     '8'     5960
   'C' 14942972     'l'   373194     'i'     3307
   ')'  7687638     '/'   342041     '9'     2354
   '('  7687638     's'   164740     '0'     1457
   'O'  5116146     'o'   161862     'e'     1421
   '1'  4388585     '+'   137743     'K'     1280
   '2'  3327339     '5'   119946     'L'      442
   '='  3311807     '.'    98163     'A'      287
   'N'  2949791     '#'    85482     'Z'      147
   '@'  2241799     'B'    82464     'b'      122
   '['  1952216     'r'    79603     'g'       87
   ']'  1952216    '\\'    78388     'M'       64
   'n'  1707728     'P'    42036     'T'       48
   '3'  1597703     '6'    29852     'p'       37
   'H'  1514805     'I'    15180     't'       21
   '-'   521529     '7'    11771     'R'       12
   'F'   497963     'a'    11311     'V'        8
   '4'   485537     '%'     6450     'X'        3

and the first 80 most common bigrams and last 19 of the 833 total (using "\n" to indicate "end of word") are:

  'cc' 7981192   ')N'  998239   '2C'  458454   'N1'  245581   ... many lines omitted ...
  'CC' 3965060   ']('  943201   ')O'  456391   'n1'  228523   '91'       1
  'O)' 3141345  '1\n'  892805   'F)'  456060   'O='  228415   '#S'       1
  'C(' 2841639   'NC'  850938   'C2'  426474   '3C'  223729   'lS'       1
  'c1' 2792900   '2)'  793573   'N('  413951   'C='  217307   '03'       1
  '(C' 2596059   '(O'  779762   '3)'  410549  'O\n'  211774  'P\\'       1
  '=O' 2496004   'Cc'  764483   '1)'  406001   '/C'  207957   'PN'       1
  ')c' 2489724   'OC'  764313   'Cl'  373140   '[n'  205910   'PO'       1
  'c(' 2444133   'CN'  744679   '-c'  362109   ']2'  203805   'K]'       1
  '(=' 2296603   '@@'  732946   'Oc'  358157   'C3'  202290   'p3'       1
  'c2' 2105356   ')['  705999   ')n'  353823   ')='  196635   'I1'       1
  ')C' 1692206   'nc'  699014   'cn'  345084   'S('  196491   'i@'       1
  '[C' 1513171   'CO'  675894   '(F'  344658   '2n'  188360   'II'       1
  'H]' 1512910   '1C'  660311   '(c'  344068   'n2'  180310   'i-'       1
  'C@' 1507657   '=C'  611606   'N)'  332280   'nH'  177385   ']I'       1
  'C)' 1489856   '3c'  598751   ']1'  308170   '1n'  174940   'Bc'       1
  '1c' 1463358   '(N'  595762   'Nc'  298653   '@]'  174511   'Bi'       1
  '@H' 1333765   'C1'  588362   'c4'  291931   '4c'  173949   'B.'       1
  '2c' 1204003   ')('  504595   '(-'  290801   '1O'  173802   'H7'       1
  'c3' 1031134   'C['  478529   'l)'  289924   'N['  170445   '.n'       1

Not something I would think is particularly interesting for an HN post.

link

TheLoneWolfling 3967 days ago

If it's not confidential (and I am assuming it isn't), why not just link it in a gist or something? That way other people can also take a crack at it.

Among other things, "a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar" is something I would be interested in. And, considering the upvotes I got for my grandparent comment, something that other people would be interested in as well.

Also: http://hnnotify.com/

link

dalke 3967 days ago

I do not like "a gist or something" and have used such services only a handful of times. I dislike how they decontextualize the conversation and how they require trust in an additional resource. Eg, when I come across a gist during a web search, it's hard to figure out the point.

By comparison, an email provides the full context, and is easier to integrate into a workflow. For example, I can drag an attachment directly into my editor. A gist requires additional steps.

Regarding hnnotify.com, I enjoy the ability to let go of most HN threads after a couple of days. This thread one of a handful of exceptions. Can I really subscribe to one-and-only-one thread? I don't see that's it's worthwhile to set up a third-party account and active the service for a rare event. In any case, if it takes a month for bmh100 to evaluate the code then the HN thread will be closed, so there's only a narrow window for which this service is useful.

I do not share your optimism in the random contributions of others. To start, it's not like I haven't talked about this before. See https://bitbucket.org/dalke/smilez and http://www.dalkescientific.com/writings/diary/archive/2007/0... (under "Compressing SMILES") for two examples. Have I gotten any feedback about them? No. So why put more effort into hoping for a one-in-a-million event, which is what you suggest, instead of optimizing the chance of getting a followup from someone who specifically expressed interest? Experience says that I should optimize for the latter.

What is your interest in the SMILES notation that can't be resolved through https://en.wikipedia.org/wiki/Simplified_molecular-input_lin... ? I would be glad to tell you more. I have worked with different aspects of SMILES for over 15 years and co-authored the OpenSMILES specification. I have also written many blog posts about different aspects of how to work with SMILES. And gotten few followups.

What skill set do you have, that I might tailor a response? Are you comfortable installing from source, do you prefer one of the GNU/Linux packaging systems, or Mac/homebrew? Or are you happiest with extracting data from a database dump? My 'synopsis .. of possible alternatives' was more an offer to follow up on any of those options, but was of itself incomplete. It works because email has the implied statement that I will respond to further questions.

If you don't have specific interest, are more generically wanting to be informed, then perhaps you can understand why I would prefer to use other mechanism, like my blog posts, which are more likely to get the kinds of responses I'm looking for than spending time tuning an off-topic HN comment.

link

TheLoneWolfling 3967 days ago

I guess the difference is:

I consider the possibility of a random person coming across something and finding it interesting a worthy goal in and of itself. You do not.

This is a rather fundamental difference, and as such I do not think that anything I say will reconcile the matter.

link