Hacker News new | ask | show | jobs
by i_like_waiting 1433 days ago
It shows once again how knowing fundamentals is important at more advanced level. There is this overall motion "very wide tables are the future of DWH", but in reality depends on context and how your setup is built.

Very interesting insight.

3 comments

But a row-based database is not a data warehouse, and a columnar database doesn't care about column order because they're not stored together.
> because they're not stored together

Most use hybrid columnar and store chunks of rows in column-major order (Snowflake, Spanner, Parquet, Arrow).

Doesn't it depend on database size? at least I heard somewhere online that column based is not really worth implementing if your regular tables have less than 10M rows.

So for small DWH, I am just using PostgreSQL

I mean, if it's small I wouldn't really call it DWH.
The term "data warehouse" is commonly used to define two different things:

1. The database where all business data for a wide swath of a company's operational groups relevant to reporting and analytics lands.

2. A specific type of database appliance/platform that is optimized for holding the type of data described above, which product is typically multi-node and often based around columnar storage of data. More recently these also emphasize ingesting or providing transparent access to to unstructured data (typically with functionality to push down queries to big data stores or other external data sources).

The first is an observation about use cases and is agnostic to technology. The second is a specific type of product that fills the use case of large instances of the first.

That's a good perspective on it, thanks. I'm absolutely guilty of seeing the term it through the lens of the technical solution rather than the problem class.
Postgres -> Citus is a great path for a v0 -> v1 data warehouse as it scales.
All knowledge is worth having and the pursuit of this knowledge will help us become more than we were. I appreciate that.

That said, consider the path of the warehouses over the last 20 years. Previously, you needed teams of data developers and engineers with modeling experts to put forth a datawarehouse that may solve a companies problem. Now, you _can_ toss very wide tables in a cloud data platform (snowflake, redshift serverless, synapse) and it likely will 'just work'. Sure it can be faster, but these problems are being slowly removed from something we have to care about.

I'm a data specialist, and my knowledge is going to be worthwhile for a good long time, but the premium that exists for it will go down I think.

Actually, this has been false or at least highly questionable for quite some time.

"Command-line tools can be faster than your Hadoop cluster"

http://aadrake.com/command-line-tools-can-be-235x-faster-tha...

https://news.ycombinator.com/item?id=8908462

Very wide tables pair well with columnar stores like ClickHouse, not with row-based ones.