|
|
|
|
|
by semi-extrinsic
3565 days ago
|
|
Well, this is actually covered in the accompanying blogpost (link in comments below), and he makes a salient point: "At the same time, it is worth understanding which of these features are boons, and which are the tail wagging the dog. We go to EC2 because it is too expensive to meet the hardware requirements of these systems locally, and fault-tolerance is only important because we have involved so many machines." Implicitly: the features you mention are only fixes introduced to solve problems that were caused by the chosen approach in the first place. |
|
The chosen approach is the only choice! There is a reason why smart people at thousands of companies use Hadoop. Fault-tolerance and Multi-user support are not mere externalities of the chosen approach but fundamental to performing data science in any organization.
Before you further comment, I highly highly encourage you to get a "Real world" experience in data science by working at a large or even medium sized company. You will realize that outside of trading engines, "faster" is typically the third or fourth most important concern. For data and computed results to be used across organization, they need to stored centrally, similarly hadoop allows you to centralize not only data but also computations. When you take this into account, it does not matter how "Fast" command line tools are on your own laptop. Since now your speed, is determined by the slowest link, which is data transfer over the network.