Hacker News new | ask | show | jobs
by lars_francke 402 days ago
Anyone looking for an open-source Cloudera alternative based on Kubernetes operators. We're building one (~5 years old now): https://stackable.tech/ & https://github.com/stackabletech/

On-premise open-source S3 is a problem though. MinIO is not something we're touching and other than that it looks a bit empty with enterprise ready solutions.

5 comments

Don’t SeaweedFS and ceph/rook also offer this? Ceph/rook is definitely enterprise ready
> On-premise open-source S3 is a problem though

Rook/ceph with object storage is pretty bulletproof: https://www.rook.io/docs/rook/v1.17/Storage-Configuration/Ob...

I do wish more systems had high quality operators out there. A lot of operators I have looked into are half baked, not reliable, or not supported.

Great to see cost-effective alternatives to Cloudera and Databricks! We’ve spent three years building IOMETE, a self-hosted data lakehouse that combines Apache Iceberg and Spark, designed to run natively on Kubernetes. We’re focused on on-premises deployments to address the growing need for data sovereignty and low TCO, with a streamlined setup for large-scale analytics. Early adopters are seeing strong results. Curious about your experience with Trino and Superset—any tips for optimizing performance at scale?
Wouldn't Rook be a good solution? It's definitely proven in much larger settings than Minio, as it's just Ceph.
What's wrong with minio out of curiosity? Ceph an option?
This is at least partially subjective.

https://news.ycombinator.com/item?id=32148007

https://news.ycombinator.com/item?id=35299665

Ceph would be a theoretical option, but a) we don't have a lot of experience with it and b) it's relatively complex to operate. We'd really love to add a lighter option to our stack that's under the stewardship of a foundation.

Try expanding a cluster, or changing erasure coding configuration, or using anything that needs random access within a file (parquet), or any day 2 operation.
Even some basic s3 storage patterns weren’t considered when the core storage scheme was designed. Lacks an index and depends on filesystem to organize objects and then crumbles to lock contention when too many versions are stored or under walkdir calls when anything is listed. It also can’t even support writing to the same set of keys as S3 should allow since it implicitly depends on underlying filesystem paths.

They might have added an index by now but gatekept it to their enterprise AIStor offering since they’ve abandoned any investment in open source at this point or appearance that they care about that. Their initial inclination in response to this issue says everything - https://github.com/minio/minio/issues/20845#issuecomment-259...

on what?
Look under the hood, the limitations are based on the core, sticking a UI on it does not hide what needs to happen at scale.
Guessing you’re referring to minio not ceph? Have they still not figured out how to do day 2? I mainly avoid them because of their license and the way they interpret it
They are not efficient; they have a one-time static hash to create a cluster. After that, it is all duct tape and glue. Want to expand? Add another cluster (pool) and then look for the cluster that contains the object. They don't know which cluster has the object, and performance does not scale as well with additional clusters. Want to decommission a single node, drain the cluster. They refer to multiple pools as a single cluster, but it is essentially a set of static hashes that lack the intelligence to locate objects. Got the initial EC configuration not quite right.. sorry need to redo the entire cluster.

MinIO is a good fit if you want a small cluster that doesn't require day 2 operational complexity, as you only store a few TBs.

I have not looked into them recently, but I doubt the core has changed. Being VC-funded and looking for an exit makes them overinvest in marketing and story telling.