|
Having worked at a large company and extensively used their Hadoop Cluster, I could not agree more with you. The author of the blogpost/article, completely misses the point. The goal with Hadoop is not minimizing the lower bound on time taken to finish the job but rather maximizing disk read throughput while supporting fault tolerance, failure recovery, elasticity, and the massive ecosystem of aggregations, data types and external integrations as you noted. Hadoop has enabled Hive, Presto and Spark. The author completely forgets that the data needs to transferred in from some network storage and the results need to be written back! For any non-trivial organization ( > 5 users), you cannot expect all of them to SSH into a single machine. It would be an instant nightmare. This article is essentially saying "I can directly write to a file in a local file system faster than to a database cluster", hence the entire DB ecosystem is hyped! Finally Hadoop is not a monolithic piece of software but an ecosystem of tools and storage engine. E.g. consider Presto, software developers at Facebook realized the exact problem outlined in the blogpost but instead of hacking bash scripts and command line tools, they built Presto. Which essentially performs similar functions on top of HDFS. Because of the way it works Presto is actually faster than "command line" tools suggested in this post. https://prestodb.io/ |
Why not?
I can do exactly this (KDB warning). Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.
> The author of the blogpost/article, completely … The author completely … [Here's my opinions about what Hadoop really is]
This is a very real data-volume with two realistic solutions, and thinking that every problem is a nail because you're so invested in your hammer is one of the things that causes people to wait for 26 minutes instead of expecting the problem to take 12 seconds.
And it gets worse: Terrabyte-ram machines are accessible to the kinds of companies that have petabytes of data and the business case for it, and managing the infrastructure and network bandwidth is bigger time sink than you think (or are willing to admit).
If I see value in Hadoop, you might think I'm splitting hairs, so let me be clear: I think Hadoop is a cancerous tumour that has led many smart people to do very very stupid things. It's slow, it's expensive, it's difficult to use, and investing in tooling is just throwing good money after bad.