Hacker News new | ask | show | jobs
by sys13 880 days ago
(I do some training for Databricks) 1. Yeah, cluster startup time can be not fun. Here are some solutions: - pools (keeps instances around so you don't have to wait for the cloud to provision them - serverless SQL warehouses (viable if you're doing only SQL) - one job with multiple tasks that share the same job cluster. Delta Live Tables does a similar thing but with streaming autoscaling - streaming: cluster never needs to go down. Can share multiple streams on the same cluster so they load balance each other
1 comments

How do you keep the pool costs manageable?

I see a lot of companies that get sold on Databricks and then are surprised by the cost.

Pool costs become more manageable as you have more clusters sharing the same pool. You can also have it have no incremental cost increase by setting the timeout to be 0, but it makes it less useful. You can have more clusters take advantage of it by using the same instance families. If you purchase reserved instances from Azure/AWS, you might as well make a pool with those as well. You may also want to check out fleet instance types.