Hacker News new | ask | show | jobs
by markus_zhang 2101 days ago
I'm wondering what you guys think about columnar databases and wide tables.

We use Vertica and from senior colleagues and even Vertica developers I got the impression that big wide tables are good because it eliminates the needs of join. Thus we don't use star schema and just wide tables.

However I think data modelling is also about embedding proper business logic and it would be a lot more confusing if two unrelated columns are stuffed in the same table.

For example let's say my event based game has a base for each player and he can do a lot of things in his base. If we use a wide table then we wilk see columns related to factory building side by side with pig feeding, and because each small feature has some unique columns, a lot of those columns are NULL simply because this event json doesn't even have th fields.

I'm wondering if we should use Vertica for a transactional type table and then use say sql server for dwh and build more traditional data modelling. But this could be awfully wrong maybe...

2 comments

A few comments based on lots and lots of experience:

- Wide tables in columnar DBs can make some analytics queries easier to write and sometimes more performant.

- Wide tables can come at high storage cost and make other queries less performant (like SELECT *)

- How much of these things happen is extremely DB dependent. How does the DB's underlying storage mechanism work? How is the data partitioned and distributed? How sophisticated and storage-aware is the query planner? How large is your data? How fast is your data growing? How fast do you need a new data point to be reflected in your dashboard?

There's no free lunch when it comes to this stuff. A perfect solution doesn't exist, but the technology is getting better all the time. I've personally never used sql server as a data warehouse but plenty of folks do.

The stuff I use that I recommend very highly - Snowflake, TimescaleDB, vanilla Postgres. Also, use dbt.

Thanks teej for the answer.

>How much of these things happen is extremely DB dependent. How does the DB's underlying storage mechanism work? How is the data partitioned and distributed? How sophisticated and storage-aware is the query planner? How large is your data? How fast is your data growing? How fast do you need a new data point to be reflected in your dashboard?

I think most of my frustration comes from not knowing these stuffs. I work as a BA-BI hybrid as I found my data analysis skills are good complements to data modelling/airflow type of work, so I persuaded my manager to give me some BI work. But that also means I don't have the technical knowledge such as DB internals (and TBH I can't even find a book for Vertica on that matter).

Our DB and DBAs and all ETL processes are located in HQ and we actually don't own our databases. This, I guess, adds an extra layer of discomfirt as we are effectively cut off from database-level optimization. Our data engineer is about to leave because he has no DE work to do (every ETL has to go through HQ's process and we only need to submit some configuration files).

We don't have access to the databases you recommended (again HQ controls that), but I do believe I could try DBT, may I ask how do you use it? From my understanding it is mostly a transformation tool, but what makes it stand out?

No amount of technology can fix a dysfunctional org.

A large amount of the work of BI is getting data into the right shape in order to ask questions about it.

- What’s level on average do new players drop off?

- How are sales of the new skins we introduced performing vs the last set?

- How often are players engaging with feature A vs B?

You might be able to answer all of these questions with just event data in a wide table, but you’ll write very different queries to answer them. dbt is a way to write queries on your source data that leads to systematic, repeatable, and reusable new tables for answering analytics questions.

No. Vertica excels at being a column store database - a data model fit for analytics and OLAP. You should pretty much never use Vertica nor column stores for OLTP/transactional use cases.

It's not about the columns being wide, per se that is important in Vertica - although that is a benefit. It is about columnar storage and optimized querying, over row storage and optimized writing.

It would be the other way around - SQL Server or other transactional databases for that part. Actually, these days, SQL Server supports column store indexes on top of transactional tables- which gives you HTAP - hybrid transactional/analytical processing. A few other systems do this (Oracle too).

If you do HTAP, in a way you could even avoid using Vertica for data warehouse/OLAP use cases. Or you can build a separate dimensional data model in SQL Server and keep everything in there.

But if you already have Vertica, which is quite fast and good at OLAP queries on dimensional models, use it for that (enterprise data warehouse), and feed data from transactional systems in.

Thanks! Yeah you are right, Vertica is good at OLAP and DWH. I'm just torn between wide table and dimensional data modelling and it's Vertica's problem. I'm probably attacking the wrong monster here.

I'm wondering if you have any source for OLAP data modelling under big data and columnar database?

IMHO the Wikipedia definition of OLAP [1] covers the general topic quite well. Doing OLAP in the big data world, if you use the sense of OLAP equating to multidimensional analysis of data aggregated into cubes, there are several approaches to it in the open source world, such as Apache Kylin [2].

The basic idea remains the same though - building dimensional data models, and then using OLAP/cube technology with MDX on top.

That said, if I think about it, my experiences with wide columnar databases were with HBase and Cassandra. The main goals when using those systems were usually three things: A) getting many attributes/variables being tracked in a single row as part of a measurement or observation of some event - and needing to write the data ridiculously fast (append only) and report on that data in real time; B) coming up with creative data processing patterns to aggregate data into higher level tabular structures, to make for blazing fast queries on these aggregates, which could be updating in real time, and C) being able to store huge volumes of data on a redundant cluster in an efficient format.

If you look at A, B and C above, these are, in a sense, the same reasons why people wish to implement OLAP cube technology, for the most part - aggregating huge results into aggregated views, lightning fast query responses and easy of navigating - "slicing and dicing" the data. However, OLAP cubes are most commonly associated with batch processing - there are real-time OLAP cube technologies, that basically work by doing change data detection on the sources.

I kind of believe that, you don't need specialized OLAP technology to accomplish the same thing - it's all about pre-aggregated and calculated data, efficient writing/reading and storage. I mean, OLAP technology is there to try to make that easier to do, but you can do the same thing with certain tricks in traditional dimensional models (accumulating snapshots, periodic snapshots, aggregate fact tables). Or in wide columnar databases (storing aggregates in wide tables). It's just maybe easier to use OLAP technology, otherwise, you kind of need to do all the scheduling, processing and aggregating of data yourself - but it's totally doable too and I have seen that work really well.

I personally have never been a fan of OLAP cube technology (MDX), even for how powerful it is, because it was so specialized - but that opinion was formed from a legacy of years of only being able to get these features from commercial vendors.

If you look at the legacy of SQL as a language, I actually believe that there was an attempt to unify OLAP queries with SQL by implementing "CUBE" and "ROLLUP" as part of ANSI SQL standard, but it was never as powerful as MDX and the OLAP engines that were built.

I think today, the world has changed, as we have open source OLAP and MDX options available - Pentaho Mondrian was popular for a while, not sure how much today. The current hot thing seems to be Apache Kylin. I also played with Apache Druid, in the past, although it doesn't seem as popular any more, as far as I can tell. Yandex Clickhouse is very popular also these days, for being extremely fast and open source. Again, Wikipedia has a good list of OLAP systems [3].

More to the point, if you use an event-based streaming approach (things like Kafka) - where the event comes directly from the source, combined with writing the data efficiently to something like an append-only, wide, column data store.. well, you have something pretty powerful. The only thing that a solution like this lacks, sometimes, is a powerful query engine - for example, Cassandra or HBase cannot do joins. There are ways to simulate joins with Apache Hive or other technologies, sitting on top of Cassandra or HBase.. but not sure I ever saw those work well.

I mean, the goal of aggregation is to avoid needing joins.. but, you'd be surprised how once in a while, you really wish you could just join this with that to come up with a new result - even in things like Cassandra or HBase. If you do your design right, you can avoid it.. but I find myself wanting it sometimes.

Really, in OLAP, everything is built to avoid JOINs or minimize them as much as possible. It's one reason, I believe though, that having JOINs in a dimensional model is still helpful. But maybe that's the old geezer who still likes SQL in me talking :)

Lastly, I thought I might mention - anyone who says that leaving all data without any kind of model structure or architecture (ex. flat or raw data only) is in a world of hurt, once you get to scale. This is what creates data swamps instead of data lakes. Data models are still as important as ever - snowflake, dimensional, multidimensional, etc. Pure raw data can be very useful for exploratory analysis, even machine learning models, but easy analysis for end consumers always requires some form of data modelling and architecture.

There is an opinion that we should throw data modelling to the wayside, and just feed raw data into AI and machine learning models - but I personally have never seen this as a reality yet.

I hope all the above might be useful!

[1] https://en.wikipedia.org/wiki/Online_analytical_processing

[2] http://kylin.apache.org/

[3] https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers