|
|
|
|
|
by fnordpiglet
1397 days ago
|
|
It is a terrible article. I’ve been on the engineering side of these big data platforms including snowflake in its early days, Paraccel (redshift’s code ancestor), redshift, and others you probably use but don’t realize are actually hyper scale database engines. The author missed the mark consistently. I chortled when he discussed the redshift WLM which I helped design a very long time ago and it’s absolute garbage. Snowflakes entire point is you can decouple the storage and the database from the warehouse query engine to provide total isolation from noisy neighbors. If you’re encountering noisy neighbors you’re using the product entirely wrong. And you’re right. The motivation snowflake has to improve is survival. It’s not like their architecture is impossible to replicate. Redshift is doing a total reorganization of the product and rewrite to compete more directly with snowflake (redshift aqua etc). They also seem to completely discount the value of SaaS outsourcing database and storage operations to snowflake whose only focus is operating the database product. Running your own clusters is an exercise that seems smart in the first few months then like a puppy when it grows up you’re stuck with a dog. If you love dogs and train them well then great. But fact is most people are terrible dog owners, and the same is true for MPP clusters. Being able to focus on the query management operations exclusively is really ideal. Highly stateful distributed products are a PITA. He also rants about snowflake not telling him the hardware. Snowflake runs in ec2, gcp, azure. You can literally guess the hardware types - there’s just not that many saddle point instance types for that sort of workload. Discussing ssd vs hdd is also an obvious sign of ignorance - it’s basic premise is it does very wide highly concurrent s3 gets and scans of the data using a foundation db metadata catalog to help prune. Being in aws, it’s implausible they use hdd and realistically they could elide ssds (I do not remember if they use local disks for caching, but it’s stateless regardless). The unit costing being hardware agnostic is totally normal too - they don’t have to expose to you the details of their costing because they normalize it to a standard fictional unit. |
|
The thing it's most right about is the power imbalance and the innovators dilemma. I've had more than one instance of the case where we've found that query performance/cost is too high, complained about it, and Snowflake have "made a configuration change" (undisclosed) that has brought the cost down.