| Depends on multiple factors:
- S3 or compatible would be trivial choice for storage - if on-prem is a must there are multiple options, generally something with erasure codes (it is a game changer for storage) So far I have been using enterprise storage (that has some potential problems when mounted as nfs volumes), works for petabytes, already decouples storage from compute. More recently I was experimenting with MinIO. No conclusion so far. The problems are with Hadoop: - unfortunate design choices (namenode??) - extremely unfortunate implementation (I probably spent more time in the Hadoop codebase than any other, found many bugs, some I could fix, most I couldn’t) I think I have migrated away from Hadoop 10 PB worth of data infra in the last 5 years, mostly to AWS, some to Azure. Average cost saving is between 10-30% yoy. Some comments point out the network cost. The reality is most companies collect a giant amount of data (ingress) and publish dashboards (egress). It makes cloud pretty viable. S3 is beating the shit out of HDFS in reliability and cost, even though most Hadoop shops spread the fud that it is slow. Same way these companies used to spread the fud that snappy is best for data compression. As of 2021 even the latest adopters (banks and insurance companies) use cloud. Maybe extremely few dogmatic companies remain in the onprem crowd. Even those will eventually give up. |