| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pavitheran 1860 days ago
	What should be used instead of Hadoop DFS in 2021?

2 comments

throwaway7783 1860 days ago

Oh, these cloud kids...

Edit: instant downvotes. Okay, S3/GCS/Azure is the typical answer (egress costs be damned)

link

StreamBright 1860 days ago

I guess the same was said about those non-mainframe kids 30 years ago. Tech gets cheaper and better. You can rent CPU time instead of buying a server. It is just that simple.

link

notyourday 1859 days ago

And overpay in orders of magnitude on a 5 year depreciation schedule. And the amount of fun you are going to have getting charged for the S3 API calls and pulling that data out of S3 for processing would make anyone reasonable CFO head spin 360 degrees every minute.

link

ppf 1859 days ago

You mean, like a mainframe? Tech goes round in circles, more like.

link

StreamBright 1860 days ago

Depends on multiple factors: - S3 or compatible would be trivial choice for storage

- if on-prem is a must there are multiple options, generally something with erasure codes (it is a game changer for storage)

So far I have been using enterprise storage (that has some potential problems when mounted as nfs volumes), works for petabytes, already decouples storage from compute.

More recently I was experimenting with MinIO. No conclusion so far.

The problems are with Hadoop:

- unfortunate design choices (namenode??)

- extremely unfortunate implementation (I probably spent more time in the Hadoop codebase than any other, found many bugs, some I could fix, most I couldn’t)

I think I have migrated away from Hadoop 10 PB worth of data infra in the last 5 years, mostly to AWS, some to Azure. Average cost saving is between 10-30% yoy.

Some comments point out the network cost. The reality is most companies collect a giant amount of data (ingress) and publish dashboards (egress). It makes cloud pretty viable.

S3 is beating the shit out of HDFS in reliability and cost, even though most Hadoop shops spread the fud that it is slow. Same way these companies used to spread the fud that snappy is best for data compression.

As of 2021 even the latest adopters (banks and insurance companies) use cloud. Maybe extremely few dogmatic companies remain in the onprem crowd. Even those will eventually give up.

link

bogomipz 1859 days ago

It's quite odd that your calling out Uber for using HDFS because it's so "2014", legacy and inefficient and yet your solution involves NFS and an enterprise storage vendor? Do you not see any irony in that? I think that many would argue your solution is far more legacy. It's also far more inefficient from a cost perspective. I have yet to see a storage vendor who could match the price of commodity hardware.

link

StreamBright 1859 days ago

I agree, it is quite ironic that I can beat HDFS with a SAN + NFS mounts.

link

sershe 1859 days ago

Did S3 really become that much faster over the years? I was optimizing a cache that worked on top of HDFS and cloud FSes at some point few years ago, and I remember making slides to present the improvements. For HDFS I had this complex slide that tried to show cache is actually achieving some fractional improvement at some point. For the unnamed cloud FS, I just had to make a 2-bar chart that looked like the cloud FS is giving you a middle finger sideways... A really tall bar for the time it took to read data off the cloud FS. A really small bar with cache.

link

junon 1859 days ago

Uber has its own datacenters. They've self-hosted for quite a while now, with few exceptions.

AWS is incredibly expensive in comparison. Uber is /not/ a small company technology-wise, for better or for worse.

link

commandlinefan 1859 days ago

> extremely unfortunate implementation (I probably spent more time in the Hadoop codebase

Well in fairness, have you ever seen the S3 codebase? I mean honestly it could be a fork of HDFS for all we know.

link

StreamBright 1859 days ago

I used to work for Amazon. The code quality at places like Google and Amazon tend to be good.

S3 has a really good architecture and a great implementation.

HDFS has a meh architecture with a bad implementation.

There were obvious signs. I remember when Twitter decided to investigate why HDFS was slow and they figured out some details about how Hadoop guys decided to implement their own dictionary for configuration that had a much worse time complexity than the default dictionary in Java. There might be a video about this somewhere.

And there are more things like that. I used to have 5-10 years old HDFS Jira tickets open. I just gave up.

Here is a video:

https://www.youtube.com/watch?v=jupArYWxoq0

Hadoop is full of these things.

One more thing:

https://lamport.azurewebsites.net/tla/formal-methods-amazon....

I would love to see similar approach to Hadoop.

link

geoduck14 1859 days ago

You mention that you have migrated to both AWS and Azure.

I've seen distributed file systems on S3 - can it also be done with Azure Blob storage?

link