| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by imslowbutnice 1678 days ago

I dont get still how much optimization was done for the Snowflake TPC-DS power run. This is what I am seeing so far and what i am foggy on -

DB1.Databricks generated the TPC-DS datasets from TPC-DS kit before time started. Databricks starts time then generated all queries. Then Databricks loaded from CSV to Delta format (also some delta tables were partitioned delta tables by date) and also computed statistics. Then all of the queries are executed 1-99 for TPCDS 100TB

SF1. Databricks generated the TPC-DS datasets from TPC-DS kit before time started. Databricks starts time then generated all queries. Then load from S3 to Snowflake tables by - (i'm not sure about these next parts) - creating external stages and then "copy into" statements I guess? Or maybe just using copy into from an s3 bucket, that part doesnt matter much. But its not clear did they also allow target tables to be partitioned/clustering keys at all? Then all of the queries are executed 1-99 for TPCDS 100TB

Its just hard to say exactly what "They were not allowed to apply any optimizations that would require deep understanding of the dataset or queries (as done in the Snowflake pre-baked dataset, with additional clustering columns)" means exactly. Like what does that exactly mean. At a glance though, this looks very impressive for Databricks, but just want to be sure before I submit to an opinion.