Hacker News new | ask | show | jobs
by EdwardDiego 1681 days ago
Very much true. I saw a joke tweet recently something along the lines of - It's amazing how many data engineering scaling issues these days are being solved by just paying Snowflake more money.

Spark does take a lot of tuning, but then I'm guessing Databricks offer that service as part of your licensing fee? (I'd hope so if they're selling a product based on FOSS code, there has to be a value add to justify it)

1 comments

> I'd hope so if they're selling a product based on FOSS code, there has to be a value add to justify it

They have some proprietary features like DBIO [1]. They also have some cloud-specific features like storage autoscaling [2] that would not be available in OSS Spark. Even Delta Lake [3] used to be proprietary, but I suspect the rise of open-source frameworks like Iceberg led them to open-source it.

Shameless plug - when working at a since-shutdown competitor to Databricks, I'd come up with storage autoscaling long before them [4], so it's not unlikely that they were "inspired" by us :-) .

1. https://docs.databricks.com/spark/latest/spark-sql/dbio-comm...

2. https://databricks.com/blog/2017/12/01/transparent-autoscali...

3. https://delta.io/

4. https://www.qubole.com/blog/auto-scaling-in-qubole-with-aws-...

The open source Delta is not a replacement for the real thing - they did not include features like optimizing small files (small file problem is well known in big data, and much more of a problem once streaming gets involved) and others. It is more of a demo of the real thing. Which does not stop them from repeating everywhere how open they are, of course.

EDIT: the delta also still keeps partitioning information in the hive metastore, while iceberg keeps it in storage, making it a far superior design. Adopting iceberg is harder due to third party tools like AWS Redshift not supporting it - you have to go 100 % of the way.

>the delta also still keeps partitioning information in the hive metastore, while iceberg keeps it in storage, making it a far superior design.

Check out https://github.com/delta-io/delta/blob/3ffb30d86c6acda9b59b9... when you get a chance. You don't need hive metastore to query delta tables since all metadata for a Delta table is stored alongside the data

>they did not include features like optimizing small files

For optimizing small files, you could run https://docs.delta.io/latest/best-practices.html#compact-fil...