Hacker News new | ask | show | jobs
by tommyphongs 1391 days ago
The article I doesn't have exprience with Snowflake but with Cloudera's tech stack on on-primise infrastructure. Both Cloudera and Snowflake use same approach: Separating computing and storage with main purpose: trade-of performance for scalability, easily maintaining without knowledge about user data, thus easily selling the solution to a wide range of customers without care about customer cost( maybe this also of them purpose). In my experience with Cloudera's tech stack, it become very complexity bruce-forced system, we need install HDFS for store data( storage layer), and Hive ( basically use Mysql to keep mapping between table and the hdfs file of that table)metadata store to keep HDFS's metadata, Impala to query engine( computing layer). Because computing layer don't know much about how data are organized, It is very limited when we want optimise our system, query like 'select * from TABLE limit 1' lead to scan overall data on many of hdfs file, and because Impala is memory computing engine, scan all table data lead to memory exceed, and because that, DA can't use sampling data to quickly manipulate with our data. Everything leads to the hell, and because many of things can effect to our system: HDFS, Impala, Hive metadata store, etc... so very hard to fix problem when it occurred.