Hacker News new | ask | show | jobs
by hamner 5514 days ago
The argument that important ML algorithms should be highly scalable ( O(logN), O(N), O(NlogN) ) holds in fields that are rich in "big data," with millions to trillions of data points.

However, there are also many fields where acquiring a large ( > 100s-1000s of samples) is infeasible. This is especially relevant in medicine and biology. Many applications are constrained by small sample sizes and may have a feature count that is orders of magnitude larger than the sample count. Examples include fMRI studies and gene expression studies. Don't discount research in methodologies (such as SVMs and many graphical models) that have superlinear performance as impractical for real-world applications, because these are used heavily in certain fields.

3 comments

My impression was that the OP didn't say superlinear algorithms are somehow useless; merely that there are reasons why the linear (or better) ones can be used in much more general settings, which is what makes them "big impact".
Experimental settings like these are also interesting because they provide opportunities for the application of active learning.

For example, if we wish to learn the properties of some family of new materials, we may have to choose which particular elements of the family to synthesise before we can begin measuring anything. Even if we can only afford to take 100 samples, or less, the synthesis procedure might have a dozen parameters or more.

We then have the problem of sensibly selecting a small number of samples from a relatively high dimensional space. In this situation it could be very easy to justify some serious computational effort if it could potentially save months of wasted effort in the lab.

Agreed, and I'd add that I found the title to be misleading. It's one researcher's take on what's important, but of necessity it's constrained by that person's interests.

Yahoo cares about big data, but not everyone is in that domain.