Hacker News new | ask | show | jobs
by pavitheran 1860 days ago
What should be used instead of Hadoop DFS in 2021?
2 comments

Oh, these cloud kids...

Edit: instant downvotes. Okay, S3/GCS/Azure is the typical answer (egress costs be damned)

I guess the same was said about those non-mainframe kids 30 years ago. Tech gets cheaper and better. You can rent CPU time instead of buying a server. It is just that simple.
And overpay in orders of magnitude on a 5 year depreciation schedule. And the amount of fun you are going to have getting charged for the S3 API calls and pulling that data out of S3 for processing would make anyone reasonable CFO head spin 360 degrees every minute.
You mean, like a mainframe? Tech goes round in circles, more like.
Depends on multiple factors: - S3 or compatible would be trivial choice for storage

- if on-prem is a must there are multiple options, generally something with erasure codes (it is a game changer for storage)

So far I have been using enterprise storage (that has some potential problems when mounted as nfs volumes), works for petabytes, already decouples storage from compute.

More recently I was experimenting with MinIO. No conclusion so far.

The problems are with Hadoop:

- unfortunate design choices (namenode??)

- extremely unfortunate implementation (I probably spent more time in the Hadoop codebase than any other, found many bugs, some I could fix, most I couldn’t)

I think I have migrated away from Hadoop 10 PB worth of data infra in the last 5 years, mostly to AWS, some to Azure. Average cost saving is between 10-30% yoy.

Some comments point out the network cost. The reality is most companies collect a giant amount of data (ingress) and publish dashboards (egress). It makes cloud pretty viable.

S3 is beating the shit out of HDFS in reliability and cost, even though most Hadoop shops spread the fud that it is slow. Same way these companies used to spread the fud that snappy is best for data compression.

As of 2021 even the latest adopters (banks and insurance companies) use cloud. Maybe extremely few dogmatic companies remain in the onprem crowd. Even those will eventually give up.

It's quite odd that your calling out Uber for using HDFS because it's so "2014", legacy and inefficient and yet your solution involves NFS and an enterprise storage vendor? Do you not see any irony in that? I think that many would argue your solution is far more legacy. It's also far more inefficient from a cost perspective. I have yet to see a storage vendor who could match the price of commodity hardware.
I agree, it is quite ironic that I can beat HDFS with a SAN + NFS mounts.
Did S3 really become that much faster over the years? I was optimizing a cache that worked on top of HDFS and cloud FSes at some point few years ago, and I remember making slides to present the improvements. For HDFS I had this complex slide that tried to show cache is actually achieving some fractional improvement at some point. For the unnamed cloud FS, I just had to make a 2-bar chart that looked like the cloud FS is giving you a middle finger sideways... A really tall bar for the time it took to read data off the cloud FS. A really small bar with cache.
Uber has its own datacenters. They've self-hosted for quite a while now, with few exceptions.

AWS is incredibly expensive in comparison. Uber is /not/ a small company technology-wise, for better or for worse.

> extremely unfortunate implementation (I probably spent more time in the Hadoop codebase

Well in fairness, have you ever seen the S3 codebase? I mean honestly it could be a fork of HDFS for all we know.

I used to work for Amazon. The code quality at places like Google and Amazon tend to be good.

S3 has a really good architecture and a great implementation.

HDFS has a meh architecture with a bad implementation.

There were obvious signs. I remember when Twitter decided to investigate why HDFS was slow and they figured out some details about how Hadoop guys decided to implement their own dictionary for configuration that had a much worse time complexity than the default dictionary in Java. There might be a video about this somewhere.

And there are more things like that. I used to have 5-10 years old HDFS Jira tickets open. I just gave up.

Here is a video:

https://www.youtube.com/watch?v=jupArYWxoq0

Hadoop is full of these things.

One more thing:

https://lamport.azurewebsites.net/tla/formal-methods-amazon....

I would love to see similar approach to Hadoop.

You mention that you have migrated to both AWS and Azure.

I've seen distributed file systems on S3 - can it also be done with Azure Blob storage?

Did you move the compute layer to AWS as well? Did you see similar savings there as well for non-burst payloads?
>works for petabytes

Per the article Uber has hundreds of petabytes of data.