|
The email I sent was 178K long, with 1000 real-world examples (to get an idea of the character distribution), and the .h file model generated by shoco on the entire data set. Assuming that bmh100 is both interested in working on this and doesn't have the domain knowledge, I gave a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar. (Each method requiring less domain knowledge and more CS experience.) This this is a big chunk to chew on, and this is the weekend, I figure it will take a few days to digest and be able to response. Since HN doesn't have notifications, how long should I actively check this thread for replies? By sending email, I also invite a response after a couple of months, should that be the case. (I yesterday got a followup on a topic that was 4 years old.) So no, supporting these long-term research exchanges is not one of the main goals of HN. You'll note that I also answered what bmh100 asked for here. If you find it interesting, then feel feel to ask interesting questions. |
Oops! I forgot information about the unique symbols. Here's each unique symbol and its count.
and the first 80 most common bigrams and last 19 of the 833 total (using "\n" to indicate "end of word") are: Not something I would think is particularly interesting for an HN post.