|
|
|
|
|
by macksd
1850 days ago
|
|
Worked at Cloudera pre- and post-merger. I thought of on-premises CDH clusters (and similarly HDP clusters) as trying to be the majority of your data infrastructure, but open so that it can integrate with other stuff. It's not just about having big data, but one place to store all of that data regardless of schema: massive database tables, logs, etc. all on shared hardware. AND frameworks to process it different ways in-place: SQL queries, Spark jobs, Search, etc. Data gravity was very important to the business model. As more people moved to the cloud, Hadoop-style storage was extremely expensive (naively moving your Hadoop cluster to 3x replication on EBS volumes would result in a nasty case of sticker shock) so the data would move to S3 / ADLS / GCP. And now you've lost your data gravity. Post-merger Cloudera focused less on on-premises clusters and tried to offer those same diverse workloads as a multi-cloud SaaS, with more focus on elasticity. This is hard because (a) there's a massive amount of surface area if you want enterprise customers to bring their own accounts, run all these managed open-source services in those accounts, and be multi-cloud, and (b) you're just competing more directly with the cloud vendors, on their turf as both a customer, partner and competitor. |
|
You had to worry about the size of files since the NameNode would be overloaded. Being a Java app running on the older JVMs it would do a full GC under heavy load and cause failovers. And it was impossible to get data in/out from outside the cluster using third party tools.
I remember many companies seeing S3 and just being in shock that it was so cheap, limitless and that someone else was going to manage it all.