Hacker News new | ask | show | jobs
by oxfordmale 1252 days ago
Benchmarks are generally useless unless they test real world scenarios. The DataBricks data warehouse record costed $5,190,345 USD to run over a period of 3 years. If I spend that amount of money, I will get fired.

Such benchmarks also ignore the engineering expertise an organisation has. Do you need to be an expert to fine tune 6000 parameters or can you tune the system to an acceptable standard by reading a few blogs.

Some people pointed out the actual query only coated $242. My counter argument is that this appears to be based on buying reserved instances from AWS for 3 years. In real life this query would also run daily, or at least you would need several iterations to get the results you want.

The costs also include a super low budget laptop ($279). It is more than fine for running the query, however, you wouldn't use it a development machine. This shows these results have been heavily massaged.

3 comments

Not to mention that engineering expertise is just the potential. You then also need the time and the willingness to actually do that kind of tedious and potentially slow moving work instead of all the other things on your list. And as we all know, the list of things that can be improved in any system typically grows over time.

The 'out of the box' or naive and un-optimized performance of something is the baseline. And with something as huge and self-contained as a database you want the happy path to be fine in terms of performance.

I was curious about it, so I tried to figure out where you got this number from. It looks like your source is https://www.tpc.org/results/individual_results/databricks/da..., but you interpreted it wrong. The number you quoted is the projected 3-year ownership of the system configuration that was used to run the test, so the actual cost is a small fraction of the number you quoted.
It is worth noting the compute cost appears to be based on purchasing reserved instances from AWS. The price of on demand instances is much higher.

The laptop is also very low budget. I am sure it is fine to run the final query, however, you would unlikely to be able to use that as a development machine.

That number should be "The total 3-year price of the entire Priced Configuration must be reported, including: hardware, software, and maintenance charges", so they just took the cost of the hardware used for benchmark, and extended it to 3 years.

If you look into the blog post: https://www.databricks.com/blog/2021/11/02/databricks-sets-o..., you will see that it costed $242

Yes the final run, establishing the record, costed $242. I would love to know what the total compute costs for this project was. In real world situations, you run this query daily, or at least multiple times to fine tune it. The point still stands that I can't afford to run on this type of hardware, as it is too expensive, nor do I have such heavy workloads, so these results are not relevant.
yes, but it's a best thing of the cloud - your cluster doesn't run when you don't need it, plus you can advantages of spot instances, autoscaling, etc.. And you won't do TPC test each every half an hour.