Hacker News new | ask | show | jobs
by markus_zhang 2101 days ago
Thanks! Yeah you are right, Vertica is good at OLAP and DWH. I'm just torn between wide table and dimensional data modelling and it's Vertica's problem. I'm probably attacking the wrong monster here.

I'm wondering if you have any source for OLAP data modelling under big data and columnar database?

1 comments

IMHO the Wikipedia definition of OLAP [1] covers the general topic quite well. Doing OLAP in the big data world, if you use the sense of OLAP equating to multidimensional analysis of data aggregated into cubes, there are several approaches to it in the open source world, such as Apache Kylin [2].

The basic idea remains the same though - building dimensional data models, and then using OLAP/cube technology with MDX on top.

That said, if I think about it, my experiences with wide columnar databases were with HBase and Cassandra. The main goals when using those systems were usually three things: A) getting many attributes/variables being tracked in a single row as part of a measurement or observation of some event - and needing to write the data ridiculously fast (append only) and report on that data in real time; B) coming up with creative data processing patterns to aggregate data into higher level tabular structures, to make for blazing fast queries on these aggregates, which could be updating in real time, and C) being able to store huge volumes of data on a redundant cluster in an efficient format.

If you look at A, B and C above, these are, in a sense, the same reasons why people wish to implement OLAP cube technology, for the most part - aggregating huge results into aggregated views, lightning fast query responses and easy of navigating - "slicing and dicing" the data. However, OLAP cubes are most commonly associated with batch processing - there are real-time OLAP cube technologies, that basically work by doing change data detection on the sources.

I kind of believe that, you don't need specialized OLAP technology to accomplish the same thing - it's all about pre-aggregated and calculated data, efficient writing/reading and storage. I mean, OLAP technology is there to try to make that easier to do, but you can do the same thing with certain tricks in traditional dimensional models (accumulating snapshots, periodic snapshots, aggregate fact tables). Or in wide columnar databases (storing aggregates in wide tables). It's just maybe easier to use OLAP technology, otherwise, you kind of need to do all the scheduling, processing and aggregating of data yourself - but it's totally doable too and I have seen that work really well.

I personally have never been a fan of OLAP cube technology (MDX), even for how powerful it is, because it was so specialized - but that opinion was formed from a legacy of years of only being able to get these features from commercial vendors.

If you look at the legacy of SQL as a language, I actually believe that there was an attempt to unify OLAP queries with SQL by implementing "CUBE" and "ROLLUP" as part of ANSI SQL standard, but it was never as powerful as MDX and the OLAP engines that were built.

I think today, the world has changed, as we have open source OLAP and MDX options available - Pentaho Mondrian was popular for a while, not sure how much today. The current hot thing seems to be Apache Kylin. I also played with Apache Druid, in the past, although it doesn't seem as popular any more, as far as I can tell. Yandex Clickhouse is very popular also these days, for being extremely fast and open source. Again, Wikipedia has a good list of OLAP systems [3].

More to the point, if you use an event-based streaming approach (things like Kafka) - where the event comes directly from the source, combined with writing the data efficiently to something like an append-only, wide, column data store.. well, you have something pretty powerful. The only thing that a solution like this lacks, sometimes, is a powerful query engine - for example, Cassandra or HBase cannot do joins. There are ways to simulate joins with Apache Hive or other technologies, sitting on top of Cassandra or HBase.. but not sure I ever saw those work well.

I mean, the goal of aggregation is to avoid needing joins.. but, you'd be surprised how once in a while, you really wish you could just join this with that to come up with a new result - even in things like Cassandra or HBase. If you do your design right, you can avoid it.. but I find myself wanting it sometimes.

Really, in OLAP, everything is built to avoid JOINs or minimize them as much as possible. It's one reason, I believe though, that having JOINs in a dimensional model is still helpful. But maybe that's the old geezer who still likes SQL in me talking :)

Lastly, I thought I might mention - anyone who says that leaving all data without any kind of model structure or architecture (ex. flat or raw data only) is in a world of hurt, once you get to scale. This is what creates data swamps instead of data lakes. Data models are still as important as ever - snowflake, dimensional, multidimensional, etc. Pure raw data can be very useful for exploratory analysis, even machine learning models, but easy analysis for end consumers always requires some form of data modelling and architecture.

There is an opinion that we should throw data modelling to the wayside, and just feed raw data into AI and machine learning models - but I personally have never seen this as a reality yet.

I hope all the above might be useful!

[1] https://en.wikipedia.org/wiki/Online_analytical_processing

[2] http://kylin.apache.org/

[3] https://en.wikipedia.org/wiki/Comparison_of_OLAP_servers