|
I think this is also a vary salient observation. It is always possible to construct really large systems, right up to mainframes. And if you're asked to do so it helps to take as wide a view on the problem as you can. So from a technical point of view, 50GB / sec on a single machine vs 600GB / sec on a 1000 node cluster. From a cost perspective, running 1 machine is going to be a lot less than running 1,000 machines. Consider some other aspects as well. If a machine breaks, and you have only one, you are offline, if one breaks and you have 999 left, you're still 99.9% up and running. If you work in 2TB data sets, how many do you have? One? Two? Twenty? The more you have the more storage you end up putting on a machine, and even with a SAN the ability to move terabytes around is a pain. Then there is the enterprise value of the analysis. How much does the analysis add value to the product you sell? In the paper's example of Page Rank one could argue it really made Google's engine better so a lot of value. In an oil and gas context it might be the difference between finding oil or not, so again high value. But in a twitter 'bot' analysis, killing off all the identified bot accounts might have very little relative value to the overall business. The bottom line is that none of these sorts of choices can be made in isolation. Looking at the choices through a single lens, whether it is performance, cost, or capability, is rarely sufficient to make the best choice. What is more the best choice may seem like a "bad" choice from an engineering perspective but great from a finance perspective. Similarly a good choice from a finance perspective could be a horrible choice from an engineering perspective. What is important is to keep in mind the strengths of the various choices available to you, and their weaknesses. Then to select from them based on the current and future requirements for the resulting system. |