| I feel like every time something like this comes up people completely skip over the benefit of having as much of your data processing jobs in one ecosystem as possible. Many of our jobs operate on low TBs and growing but even if the data for a job is super small I'll write it in Hadoop (Spark these days) so that the build, deployment, and schedluing of the job is handled for free by our curent system. Sure spending more time listing files on S3 at startup than running the job is a waste but far less than the man hours to build and maintain a custom data transformation. The main benefit of these tools is not the scale or processing speed though. The main benefits are the fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations provided by the community. |
The author of the blogpost/article, completely misses the point. The goal with Hadoop is not minimizing the lower bound on time taken to finish the job but rather maximizing disk read throughput while supporting fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations as you noted. Hadoop has enabled Hive, Presto and Spark.
The author completely forgets that the data needs to transferred in from some network storage and the results need to be written back! For any non-trivial organization ( > 5 users), you cannot expect all of them to SSH into a single machine. It would be an instant nightmare. This article is essentially saying "I can directly write to a file in a local file system faster than to a database cluster", hence the entire DB ecosystem is hyped!
Finally Hadoop is not a monolithic piece of software but an ecosystem of tools and storage engine. E.g. consider Presto, software developers at Facebook realized the exact problem outlined in the blogpost but instead of hacking bash scripts and command line tools, they built Presto. Which essentially performs similar functions on top of HDFS. Because of the way it works Presto is actually faster than "command line" tools suggested in this post.
https://prestodb.io/