|
As an ML researcher, this article isn't persuasive to me for a few reasons: - Computing power is getting exponentially cheaper even as computing requirements increase. The resources available to a university lab in the future will be much greater than those available today, even given the same budget. Of course this is also true for industry, but this growth is not a unique advantage of industry. - Other scientific fields already have equipment costs that are orders of magnitude larger than CS. Physicists regularly write grant proposals for multimillion-dollar pieces of equipment. If building large clusters is necessary for academic research to stay relevant, academics will start building large clusters. The foundational work done at Bell, IBM, Xerox, etc in the 70s and 80s was not due to resource constraints in academia (academics had expensive computers too, and also did plenty of good work during that time), it was because those companies had the right combination of smart people and an immediate need to find practical solutions to difficult problems. - Finally, and most importantly, even in the age of big data almost all fundamental research can be done quite successfully at small scales with modest hardware requirements. Notice that Hinton et. al. have spent 6+ years developing deep learning in academia, and it's only in the past couple of years that it's matured to the point of implementation at scale. Here's the basic pipeline of most machine learning research: you come up with a new approach for training SVMs, or multilayer perceptrons, or some new type of more interesting model. First you develop your ideas conceptually, with some equations on a whiteboard. If you're a theorist, you might prove some theorems. Next you write a toy implementation in Matlab or Python to show that your method actually works, and that you get improvement over previous work for the dataset size you're using. This could mean that your method is faster -- which indicates it'll be able to scale to bigger data -- or that it's smarter / taking advantage of some new type of structure, in which case it still ought to get decent (if not state-of-the-art) results on small data. Only then, usually after publishing a few papers and working out the kinks, does it generally make sense to put in the effort to implement and test a big, efficient distributed version of your algorithm. And while that last part might be best done by industry, the first few steps are easily possible in academia and will continue to be for the foreseeable future. Case in point: Google Translate is a massive system whose performance rests squarely on exploiting big data, in that they use the Internet as their training set. But academic machine translation research still runs quite effectively with smaller datasets on small clusters. The academics come up with ideas, implement and test them, and some ideas flop while others take off. The idea that take off get picked up by Google and implemented into Translate, where they hopefully end up pushing the envelope. So even though the academics don't have the resources to work at massive scale (which most of them don't want to do anyway -- ML researchers are usually more interested in ML than in building distributed systems) their research still has impact, through transfer to industry. This sort of relationship has been the model for academic/industry research collaboration for quite a while, and I don't think it's dead yet. |
I've got no formal CS training and if I get funding for jkl.io the objective is to have (most of) a Google News (English) competitor implemented in a year, part-time. Google has thousands of ML employees but there are three million users on Github. If I need facial recognition, it's on Github. Topic modelling to layer on top of my NLP, or to aid in entity resolution, on Github. Crawlers, got it. Next gen databases (http://hyperdex.org/), got it. The jkl.io site is only just over 1000 lines of code written by me at the moment, but it probably uses tens of thousands from just the python libraries before we even talk about the DB and the OS.
The more people understand the filter bubble and the information diet concepts the more personalisation will be a thing only for side interests and friendship networks. I don't think people want black box advertising-oriented algorithms manipulating their political and economic news. The computation required for me is therefore so much smaller and cheaper. I know it's not HN's focus because people want their exit money but donation models, as Wikipedia beating Knol shows, can actually be the most efficient solution in many domains where you can't trust a corporation with a fiduciary duty to maximize shareholder profit.
People might say "but what about really huge data like location services using not just GPS, but mobile data and wifi response times, pictures from Google's new alt-reality game and street view"; they might say "Google just can't be caught up to" and point to the failure of Apple's maps. But I worked with some guys who scaled a solution using SIFT features => Lucene that could geo-locate instantly on massive datasets of images. You can prove an algorithm can scale theoretically without having 10,000 machines to run it on. One of the key points separating computer science from just programming is the analysis of algorithms in theoretical terms. Apple's failure was because they are primarily a luxury product company not an ML company but people just think "technology". Even so Apple can get stuff done, or buy companies that can (Siri). Microsoft, Yandex, Yahoo, Amazon, huge rising data powers in Asia, thousands of computer science professors, tens of thousands of post docs and doctoral students, millions of Github tinkerers are not going to fall behind. Google isn't even the major search engine in a lot of countries.