Hacker News new | ask | show | jobs
by davmre 4844 days ago
As an ML researcher, this article isn't persuasive to me for a few reasons:

- Computing power is getting exponentially cheaper even as computing requirements increase. The resources available to a university lab in the future will be much greater than those available today, even given the same budget. Of course this is also true for industry, but this growth is not a unique advantage of industry.

- Other scientific fields already have equipment costs that are orders of magnitude larger than CS. Physicists regularly write grant proposals for multimillion-dollar pieces of equipment. If building large clusters is necessary for academic research to stay relevant, academics will start building large clusters. The foundational work done at Bell, IBM, Xerox, etc in the 70s and 80s was not due to resource constraints in academia (academics had expensive computers too, and also did plenty of good work during that time), it was because those companies had the right combination of smart people and an immediate need to find practical solutions to difficult problems.

- Finally, and most importantly, even in the age of big data almost all fundamental research can be done quite successfully at small scales with modest hardware requirements. Notice that Hinton et. al. have spent 6+ years developing deep learning in academia, and it's only in the past couple of years that it's matured to the point of implementation at scale.

Here's the basic pipeline of most machine learning research: you come up with a new approach for training SVMs, or multilayer perceptrons, or some new type of more interesting model. First you develop your ideas conceptually, with some equations on a whiteboard. If you're a theorist, you might prove some theorems. Next you write a toy implementation in Matlab or Python to show that your method actually works, and that you get improvement over previous work for the dataset size you're using. This could mean that your method is faster -- which indicates it'll be able to scale to bigger data -- or that it's smarter / taking advantage of some new type of structure, in which case it still ought to get decent (if not state-of-the-art) results on small data. Only then, usually after publishing a few papers and working out the kinks, does it generally make sense to put in the effort to implement and test a big, efficient distributed version of your algorithm. And while that last part might be best done by industry, the first few steps are easily possible in academia and will continue to be for the foreseeable future.

Case in point: Google Translate is a massive system whose performance rests squarely on exploiting big data, in that they use the Internet as their training set. But academic machine translation research still runs quite effectively with smaller datasets on small clusters. The academics come up with ideas, implement and test them, and some ideas flop while others take off. The idea that take off get picked up by Google and implemented into Translate, where they hopefully end up pushing the envelope. So even though the academics don't have the resources to work at massive scale (which most of them don't want to do anyway -- ML researchers are usually more interested in ML than in building distributed systems) their research still has impact, through transfer to industry. This sort of relationship has been the model for academic/industry research collaboration for quite a while, and I don't think it's dead yet.

3 comments

Having worked with a lot of ML guys who were ahead of Google on numerous fronts I have to agree. With Knol's death Google failed to control Wikipedia, arguably one of more important ML datasets. People can fire up the common crawl on demand from Amazon. Anyone who thinks Google is the real bleeding edge just isn't browsing recent academic papers.

I've got no formal CS training and if I get funding for jkl.io the objective is to have (most of) a Google News (English) competitor implemented in a year, part-time. Google has thousands of ML employees but there are three million users on Github. If I need facial recognition, it's on Github. Topic modelling to layer on top of my NLP, or to aid in entity resolution, on Github. Crawlers, got it. Next gen databases (http://hyperdex.org/), got it. The jkl.io site is only just over 1000 lines of code written by me at the moment, but it probably uses tens of thousands from just the python libraries before we even talk about the DB and the OS.

The more people understand the filter bubble and the information diet concepts the more personalisation will be a thing only for side interests and friendship networks. I don't think people want black box advertising-oriented algorithms manipulating their political and economic news. The computation required for me is therefore so much smaller and cheaper. I know it's not HN's focus because people want their exit money but donation models, as Wikipedia beating Knol shows, can actually be the most efficient solution in many domains where you can't trust a corporation with a fiduciary duty to maximize shareholder profit.

People might say "but what about really huge data like location services using not just GPS, but mobile data and wifi response times, pictures from Google's new alt-reality game and street view"; they might say "Google just can't be caught up to" and point to the failure of Apple's maps. But I worked with some guys who scaled a solution using SIFT features => Lucene that could geo-locate instantly on massive datasets of images. You can prove an algorithm can scale theoretically without having 10,000 machines to run it on. One of the key points separating computer science from just programming is the analysis of algorithms in theoretical terms. Apple's failure was because they are primarily a luxury product company not an ML company but people just think "technology". Even so Apple can get stuff done, or buy companies that can (Siri). Microsoft, Yandex, Yahoo, Amazon, huge rising data powers in Asia, thousands of computer science professors, tens of thousands of post docs and doctoral students, millions of Github tinkerers are not going to fall behind. Google isn't even the major search engine in a lot of countries.

I attended a talk by Quoc Le at UCSD recently, and he made the case that it is necessary to get the algorithms tested large scale, rather than sending too much time on it at small scale.

He had presented a graph comparing some models and their accuracy as the number of features was scaled up to the tens of thousands, his point being that some models that work best at smaller number of features fall off as the number is scaled up. Unfortunately the slides he has on his web page is outdated, so I haven't been able to find that reference. I'd be very happy if one of you know which paper he was referring to. In the old slides he refers to this paper, which makes something of the same point: http://ai.stanford.edu/~ang/papers/nipsdlufl10-AnalysisSingl... It shows how simple unsupervised models with dense feature extraction reach the state of the art performance of more complex models.

Of course, I can see how it makes sense to at least do some small scale prototyping, to work out kinks like you say - but the lesson is that if you are planning to do large scale machine learning you can't necessarily use the small scale tests as a good guide for large scale performance. It's certainly promising if you get very good accuracy, speed or both at small scale, though neither necessarily will carry over to large scale. On the flip side, if your method is worse than state-of-the-art at smaller scales, that doesn't mean it won't beat state-of-the-art at large scales.

Data shows, as you say, that small scale performance is no indicator of large scale performance.

How then do you decide which projects are worth trying on the large scale?

Distributed systems building is not a precondition to big data ML. Most of those systems have been built and commoditized...to such an extent that the difference between having one and not boils down to a command line flag. I routinely run ML algos in local mode on my mac on a small dataset. Once its up to snuff, I turn off the--local flag, and it now runs on giant MR clusters over terabytes of data. I personally have not done any changes other than turning off the local flag.
Sure, lots of existing ML algorithms have efficient big-data implementations. But for new algorithm development, my (admittedly limited) experience is that the Matlab-prototyping stage usually comes well before the implement-at-scale stage. You're right that modern tools effectively abstract out a lot of the difficulty of implementing at scale, but IMHO it's still generally not the first thing you'd want to do.