| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dalke 3968 days ago

The email I sent was 178K long, with 1000 real-world examples (to get an idea of the character distribution), and the .h file model generated by shoco on the entire data set.

Assuming that bmh100 is both interested in working on this and doesn't have the domain knowledge, I gave a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar. (Each method requiring less domain knowledge and more CS experience.)

This this is a big chunk to chew on, and this is the weekend, I figure it will take a few days to digest and be able to response. Since HN doesn't have notifications, how long should I actively check this thread for replies?

By sending email, I also invite a response after a couple of months, should that be the case. (I yesterday got a followup on a topic that was 4 years old.) So no, supporting these long-term research exchanges is not one of the main goals of HN.

You'll note that I also answered what bmh100 asked for here. If you find it interesting, then feel feel to ask interesting questions.

2 comments

dalke 3967 days ago

> "You'll note that I also answered what bmh100 asked for here."

Oops! I forgot information about the unique symbols. Here's each unique symbol and its count.

   'c' 17253540     'S'   432387     '8'     5960
   'C' 14942972     'l'   373194     'i'     3307
   ')'  7687638     '/'   342041     '9'     2354
   '('  7687638     's'   164740     '0'     1457
   'O'  5116146     'o'   161862     'e'     1421
   '1'  4388585     '+'   137743     'K'     1280
   '2'  3327339     '5'   119946     'L'      442
   '='  3311807     '.'    98163     'A'      287
   'N'  2949791     '#'    85482     'Z'      147
   '@'  2241799     'B'    82464     'b'      122
   '['  1952216     'r'    79603     'g'       87
   ']'  1952216    '\\'    78388     'M'       64
   'n'  1707728     'P'    42036     'T'       48
   '3'  1597703     '6'    29852     'p'       37
   'H'  1514805     'I'    15180     't'       21
   '-'   521529     '7'    11771     'R'       12
   'F'   497963     'a'    11311     'V'        8
   '4'   485537     '%'     6450     'X'        3

and the first 80 most common bigrams and last 19 of the 833 total (using "\n" to indicate "end of word") are:

  'cc' 7981192   ')N'  998239   '2C'  458454   'N1'  245581   ... many lines omitted ...
  'CC' 3965060   ']('  943201   ')O'  456391   'n1'  228523   '91'       1
  'O)' 3141345  '1\n'  892805   'F)'  456060   'O='  228415   '#S'       1
  'C(' 2841639   'NC'  850938   'C2'  426474   '3C'  223729   'lS'       1
  'c1' 2792900   '2)'  793573   'N('  413951   'C='  217307   '03'       1
  '(C' 2596059   '(O'  779762   '3)'  410549  'O\n'  211774  'P\\'       1
  '=O' 2496004   'Cc'  764483   '1)'  406001   '/C'  207957   'PN'       1
  ')c' 2489724   'OC'  764313   'Cl'  373140   '[n'  205910   'PO'       1
  'c(' 2444133   'CN'  744679   '-c'  362109   ']2'  203805   'K]'       1
  '(=' 2296603   '@@'  732946   'Oc'  358157   'C3'  202290   'p3'       1
  'c2' 2105356   ')['  705999   ')n'  353823   ')='  196635   'I1'       1
  ')C' 1692206   'nc'  699014   'cn'  345084   'S('  196491   'i@'       1
  '[C' 1513171   'CO'  675894   '(F'  344658   '2n'  188360   'II'       1
  'H]' 1512910   '1C'  660311   '(c'  344068   'n2'  180310   'i-'       1
  'C@' 1507657   '=C'  611606   'N)'  332280   'nH'  177385   ']I'       1
  'C)' 1489856   '3c'  598751   ']1'  308170   '1n'  174940   'Bc'       1
  '1c' 1463358   '(N'  595762   'Nc'  298653   '@]'  174511   'Bi'       1
  '@H' 1333765   'C1'  588362   'c4'  291931   '4c'  173949   'B.'       1
  '2c' 1204003   ')('  504595   '(-'  290801   '1O'  173802   'H7'       1
  'c3' 1031134   'C['  478529   'l)'  289924   'N['  170445   '.n'       1

Not something I would think is particularly interesting for an HN post.

link

TheLoneWolfling 3967 days ago

If it's not confidential (and I am assuming it isn't), why not just link it in a gist or something? That way other people can also take a crack at it.

Among other things, "a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar" is something I would be interested in. And, considering the upvotes I got for my grandparent comment, something that other people would be interested in as well.

Also: http://hnnotify.com/

link

dalke 3967 days ago

I do not like "a gist or something" and have used such services only a handful of times. I dislike how they decontextualize the conversation and how they require trust in an additional resource. Eg, when I come across a gist during a web search, it's hard to figure out the point.

By comparison, an email provides the full context, and is easier to integrate into a workflow. For example, I can drag an attachment directly into my editor. A gist requires additional steps.

Regarding hnnotify.com, I enjoy the ability to let go of most HN threads after a couple of days. This thread one of a handful of exceptions. Can I really subscribe to one-and-only-one thread? I don't see that's it's worthwhile to set up a third-party account and active the service for a rare event. In any case, if it takes a month for bmh100 to evaluate the code then the HN thread will be closed, so there's only a narrow window for which this service is useful.

I do not share your optimism in the random contributions of others. To start, it's not like I haven't talked about this before. See https://bitbucket.org/dalke/smilez and http://www.dalkescientific.com/writings/diary/archive/2007/0... (under "Compressing SMILES") for two examples. Have I gotten any feedback about them? No. So why put more effort into hoping for a one-in-a-million event, which is what you suggest, instead of optimizing the chance of getting a followup from someone who specifically expressed interest? Experience says that I should optimize for the latter.

What is your interest in the SMILES notation that can't be resolved through https://en.wikipedia.org/wiki/Simplified_molecular-input_lin... ? I would be glad to tell you more. I have worked with different aspects of SMILES for over 15 years and co-authored the OpenSMILES specification. I have also written many blog posts about different aspects of how to work with SMILES. And gotten few followups.

What skill set do you have, that I might tailor a response? Are you comfortable installing from source, do you prefer one of the GNU/Linux packaging systems, or Mac/homebrew? Or are you happiest with extracting data from a database dump? My 'synopsis .. of possible alternatives' was more an offer to follow up on any of those options, but was of itself incomplete. It works because email has the implied statement that I will respond to further questions.

If you don't have specific interest, are more generically wanting to be informed, then perhaps you can understand why I would prefer to use other mechanism, like my blog posts, which are more likely to get the kinds of responses I'm looking for than spending time tuning an off-topic HN comment.

link

TheLoneWolfling 3967 days ago

I guess the difference is:

I consider the possibility of a random person coming across something and finding it interesting a worthy goal in and of itself. You do not.

This is a rather fundamental difference, and as such I do not think that anything I say will reconcile the matter.

link

dalke 3966 days ago

My analysis is that there are two classes of random people, while your analysis has only one. Class 1 is "random person coming across a page" and class 2 is "random person who cam across a page and expressed interest in possibly working on the problem." Both can become a member of the desired goal, that is, someone who contributes concrete help.

Experience says that both categories are low. Perhaps there's a 1:10,000 change for a member of class 1, and a 1:500 chance for a member in class 2.

If I do as you suggest, I might raise that to 10:10,000 and 10:500.

However, my belief is that directed email has a higher stickiness, because of the reasons I mentioned earlier. I believe those statistics become 1:10,000 (ie, unchanged) and 15:500, respectively.

If you work the math out, you'll see that it's overall better to send the directed email.

Another option is to do both, which you'll see is what I did for the question that was asked. Your complaint is that I should haven't sent additional information in private mail, which is odd given that HN's own guidelines suggest that there are HN-related questions that are inappropriate to post and should instead be done by email.

You have also stated that I do not "consider the possibility of a random person coming across something and finding it interesting". This simply isn't true, as you can tell from the analysis above, and from the two pages I linked to two pages where I have posted information meant for random strangers to hopefully identify.

You've come across like you are irritated for being left out of the conversation. I've suggested a few topics I could discuss, but how can I say more when you haven't expressed any specific interest about the problem (either on SMILES or short word compression). When writing, it's good to have a target audience in mind. Should I assume a basic understanding of arithmetic compression, or start from the basics? Do I need to explain state machines? And so on.

My above analysis left out the work factor. Rather than write 40 different essays, each aimed for a different set of strangers (chemist background, CS background, math background, web dev background), etc. and with at best a 1:100 chance of success, it's a better use of my time to just work on the code. I believe I could do what I want in about 2 months.

link