Hacker News new | ask | show | jobs
by socaldata 1681 days ago
Take all the problems you have had with data warehousing and throw them in a proprietary cloud. That is Snowflake. They are the best today.

Databricks started with the cloud datalake, sitting natively on parquet and using cloud native tools, fully open. Recently they added SQL to help democratize the data in the data lake versus moving it back and forth into a proprietary data warehouse.

The selling point in Databricks is why move the data around when you can just have it in one place IF performance is the same or better.

This is what led to the latest benchmark which in the writing appears to be unbiased.

In snowflakes response however, they condemn it but then submit their own fundings. Sound a lot lot trump telling everyone he had billions of people attend his inauguration, doesn’t it?

Anyhow, I trust independent studies more than I do coming from vendors. It cannot be argued or debated unless it was unfairly done. I think we are all smart enough to be careful with studies of any kind, but I can see why Databricks was excited about the findings.

2 comments

Whose result can be trusted is beside the point - I actually believe both experiments were likely conducted in good faith but with incomplete context. But that’s beside the point. The point is there’s no good reason to start a benchmark war to begin with.
> While performing the benchmarks, we noticed that the Snowflake pre-baked TPC-DS dataset had been recreated two days after our benchmark results were announced. An important part of the official benchmark is to verify the creation of the dataset. So, instead of using Snowflake’s pre-baked dataset, we uploaded an official TPC-DS dataset and used identical schema as Snowflake uses on its pre-baked dataset (including the same clustering column sets), on identical cluster size (4XL). We then ran and timed the POWER test three times. The first cold run took 10,085 secs, and the fastest of the three runs took 7,276 seconds. *Just to recap, we loaded the official TPC-DS dataset into Snowflake, timed how long it takes to run the power test, and it took 1.9x longer (best of 3) than what Snowflake reported in their blog.* https://databricks.com/blog/2021/11/15/snowflake-claims-simi...
Delta lake is not meaningfully more "open" than whatever Snowflake (or BigQuery and Redshift) are doing. It does not require any less "moving data around"

With all these, the data sits on cloud storage and compute is done by cloud machines - the difference between Databricks and the others is that with Databricks, you can take a look at that bucket. But you're not going to be able to do much with that data without paying for Databricks compute, since the open source Delta library is not usable in real world.

Since commercial data warehouses are an enterprise product for enterprise companies (small companies can use stick with normal databases or SaaS and unicorns seem to roll their own with Presto/Trino, Iceberg, Spark and k8s, nowadays), the vendor and the product needs to be most of all reliable partner. And Databricks behavior does not inspire confidence of them being that.

If I'm outsourcing my analytical platform to a vendor, I want the to be almost boring. Not some growth hacking, guerilla marketing, sketchy benchmark posting techbros.

At the end of the day, anyone making years lasting million dollar decisions in this space should run their own evaluation. Our evaluation showed that there's a noticeable gap between what Databricks promises and what they deliver. I have not worked with Snowflake to compare.

Delta lake is very much open. You can install delta lake and run it yourself. It's a transaction layer running over parquet files. You can go to the delta.io GitHub and install binaries yourself. Snowflake cannot be run independently of their cloud.

The rest of this is some vague claims of Databricks being unreliable techbros blah blah which is just emotionally charged hot air rather than being based on anything.

RE who to pick. Run them side by side. Use snowflake for non technical staff/BI load in prepared cuts of data. it's batteries included and less knobs to twiddle for optimisation. Databricks/spark has a learning code and isn't suitable for non-technical staff. But it gives a lot more options for processing for all the stuff that doesn't fit neatly into data clustering.

Sort of. You can stop using Databricks service, and keep using Delta lake. But Databricks code is not open. Delta Lake is not equivalent to Databricks delta. The value prop is that customers, if they choose to not retain databricks service, can migrate off databricks and still use the open source version of delta lake, which again, is not as good as databricks delta.
Ok you've got me there it's not 100% the exact same code Databricks are using there are some optimisations (that normally do end up downstream anyway). But I think it's getting a bit philosophical to say it's not open when you can run a delta lake "on-prem" and shuffle data between databricks and your own setup with few/no changes. Now Databricks SQL product afaik is not open and that's a proprietary C++ engine comparable to Snowflake so I think these discussions might get a lot more confusing in the future when databricks doesn't just mean various flavours of spark.
Yes Photon is completely proprietary. Databricks does have a "delta" version, but it is actually completely baked into the databricks runtime. So we are both correct. Ali (Databricks CEO) actually has gone on record to say Databricks is 90% proprietary code. There is an open source version, but it is not as good. The culture within Databricks though, is completely open source. Unlike Snowflake, the culture is definitely not open source. I think it affects the culture too.
By learning code I mean Learning curve. You need to be able to code a little bit at a minimum to use Spark effectively even if a lot of the time you can just go with the SQL interface it isn't actually a SQL database under the surface so that can be a bit misleading if you dont know what's going on.