Hacker News new | ask | show | jobs
by orf 1666 days ago
We’ve just moved to Snowflake. I haven’t really been impressed with some of the new features added to Redshift, it seems like too little and too late.

The JSON support (SUPER type) is kind of cool, and they are moving towards more “automatic” sorting + partitioning, but it’s just all a bit shit to be honest.

We encountered major bugs with data-sharing, our clusters keep insisting that zstd is the best compression format to use for all our data (but then never actually using it), materialised views often fail to update and understanding why is a nightmare, terrible performance if your strings are varchar(max) (guess what Glue sets them to…), Redshift data often just dies (4 hour downtime recently, no status page) and has some really weird semantics around listing queries, before the data API you couldn’t run async queries and it’s eventbridge integration straight up doesn’t work, nightmare bugs in the Java connection library that don’t show up using psql, tiny set of types (no arrays, uuids), unkillable queries, AQUA actually causing everything to slow down hugely, critical release notes posted only in a fucking random forum, etc etc.

Snowflake has apparently sorted this, as well as including ingestion tools (snowpipe) that you’d otherwise have to stitch together with AWS Glue or something (a cursed service if ever there was one).

That being said, in some cases Redshift absolutely flies. But the real world isn’t filled with ideal schemas and natural sort keys. It’s messy. And Snowflake deals with messy better.

3 comments

I've had terrible experiences with Snowflake "automatic" optimizations. You have zero visibility or control over partitioning, join strategy, or anything else compared to Spark/Databricks.
> You have zero visibility or control over partitioning

snowflake give you visiblity into clustering [1] and in the query profile view you can see how pruning is working( or not working)

Can you give an example of what you visibility you would like to see in terms of partitioning?

1. https://docs.snowflake.com/en/sql-reference/functions/system...

Being able to see exactly which columns are partitioned on so I know whether my query is going to be efficient or require full scan+materialization. I understand from your link Snowflake has clustering keys but it seems like this only "suggests" that Snowflake should partition on that key. Also with Spark I can look at the query plan while it's executing and figure out what part of my query is bottlenecking. Seems like Snowflake requires a query to complete for me to do that.
Clustering information is available in the object browser in the UI, as well as the results of a show table.

https://docs.snowflake.com/en/user-guide/tables-clustering-k...

Snowflake does not automatically cluster tables, because it is something users need to decide if they want to or not, as well as the best clustering strategy. In some instances it is better to cluster on one column to increase performance of loads, but a different key would benefit users querying the table. Which is the right choice? It's up to the user to decide. What if you want both? Create the table with the clustering key to optimize load, then create a materialized view that clusters by the keys best for the queries and Snowflake will automatically rewrite queries to use the MV if it makes more sense. Problem(s) solved.

You can access the query profile while it is running to understand which step is currently running and where your bottlenecks are.

https://docs.snowflake.com/en/user-guide/ui-query-profile.ht...

I use this often when troubleshooting long running queries, most often users forget a join condition and it results in a semi-product join. This is easily caught with high row counts coming out of two nodes with low row counts.

What do you mean "too late"? Redshift was released in 2012.

Edit for future readers: the original comment was "I haven’t really been impressed with Redshift, it seems like too little and too late".

Sorry, I tend to abbreviate too much. Given the context of the article (new sexy cool redshift features ripped from Snowflake) I meant to convey “the direction redshift is heading in seems too little too late”.
Oh ok. I don't know Snowflake so I had no idea this new feature was a (late) response to Snowflake.
This article was released 30 NOV 2021. So 9 years,
given snowflake also uses s3. Could aws do something with redshift that moves computation closer to s3 and be even faster than snowflake. Basically take advantage of the fact that they own s3.
they already have https://aws.amazon.com/redshift/features/aqua/

basically distributing compute down to the actual storage nodes

Yeah that’s cool except it made all our queries slower. I’ll add that to my list above, forgot about that.
I've not yet investigated AQUA, but it's absolutely on the list.

The idea that is handles some SQL functionality with additional parallelism seems fine, but then I read that if a query uses any AQUA functionality, it's something like the entire query runs in AQUA, and that seems to me to be potentially a problem.

I'm very curious to start investigating, but there's a list of higher priority matters to investigate.