Hacker News new | ask | show | jobs
by aheifets 4116 days ago
Thank you for the kind wishes!

Over the past few years, there's been a huge increase in the amount of data available for this kind of machine learning. We curate our data from a number of private and public sources. For example, as part of my doctoral work (http://en.wikipedia.org/wiki/SCRIPDB), I learned how to parse chemical information out of U.S. Patent data, which is public domain. That said, if you're interested in working on something like this and need a quick million data points, I'd point you to PubChem as a first step: https://pubchem.ncbi.nlm.nih.gov/