Hacker News new | ask | show | jobs
by pstoll 1539 days ago
Cool. Like open source Looker.

We adopted Looker at $previous_job. Then they got bought by Google, which was great for us as we were becoming a big GCP customer. I strongly encouraged google / looker team to at least open source their LookMl (looker modeling language - equivalent to MQL). They couldn’t figure it out.

This type of metric definition is so empowering for businesses. Not enough engineers grok why this is useful.

2 comments

Interested in why it's useful too. If the same SQL is used in multiple places and cause confusion, isn't it an organizational problem instead of a technical problem? For instance, what if two teams create two different and conflicting metric definitions to answer the same question? It's like turtles all the way down and how could we prevent diversion of query definitions even if we have a perfect metric definition system?
I agree this is an organizational problem, but a lot of the tools we use every day - from bug trackers to distributed version control to collaborative document authoring tools - are technical solutions to organizational problems. One key organizational problem data people routinely deal with is exactly this issue of multiple metric definitions.

Imagine a company that wants to know how much time its users spend on its application, and wants that broken down by country in order to attract investors or something. They probably don’t have one metric for time spent on the application. They have like 12, all of them are different, and although all of them are wrong they each work sort of ok for a given context. To make matters worse, those 12 metrics are defined in different places, computed in different ways, and generated from different data sources. Maybe 3 of them are documented somewhere in the company wiki while the details of the rest live in the heads of 9 current or former employees who are off doing.... something.

That is a mess. Investors don’t care about all that nuance on use cases for time spent measures. Neither does anybody talking to the investors. What they don’t want to hear, when they ask for this simple metric, is “well..... what are you planning to use this for?” On the other hand, things get even worse if somebody grabs the wrong time spent metric and presents that publicly without realizing one of the other 11 was the one they really needed.

It would be better if our hypothetical company had its people define and document a set of consistent metrics and ensure they’re always computed the same way. By centralizing those definitions into a repository (ideally managed by source control) everybody can share the understanding of what a given metric represents. More importantly, they know where to look for pre-existing metric definitions. If this hypothetical company really needs 12 different time spent metrics (they don’t, but bear with me here) then each of those metrics can be named and defined in a way that explains what it’s good for, and they can simply point the makers of their investment pitch at the “baseline_user_time_spent” metric and have done with it.

For any of that to even be possible you need, at a minimum, a place to store metric definitions and a metric layer like MetricFlow that understands those metric definitions, translates them into the relevant queries, and executes them with the required filters and grouping attributes.

Totally agree with this ^
Could you elaborate on why it's useful. I don't get it but I'm intrigued
Traditionally, querying business and product metrics for data analysis has required lots of ad-hoc sql queries. Often they are encoded in dashboards, in ETL pipelines, and others are copied directly. Semantic layers act as a single source of truth that encapsulates all of that logic. Having a single layer responsible for querying these datasets enables powerful workflows:

- It enables self-serve analytics experiences because it creates objects that business people can interact within pivot-table-like forms. Hundreds of lines of SQL are distilled to simple queries.

- The logic is DRYer / easier to govern than all the repetitive SQL rollups that are required to answer business questions.

- Analysts become more productive write the same stuff less and ask more questions faster.

The idea of limiting duplicated logic is very well understood in the software engineering community and desired in the analytics community but we’re still in the early days. In practice, this is really hard in SQL and the tools we have are too limited.

More specifically, the reason why I get excited about MetricFlow

- We basically built a generalized SQL constructor. It will be able to build performant and legible SQL for complicated requests (things that data engineers describe in hundreds of lines) through simple and consistent query interfaces.

- The way we encapsulate logic requires much fewer lines of yaml/code than most other frameworks and we can do much more with those lines. LookML and previous versions we worked on at Airbnb became quite unruly because of the choices in the abstractions.

- The metric abstraction is flexible and allows us to calculate complicated metrics with only a few lines of yaml. That means we can define metrics like conversion metrics that might take joining two data sources together, deduplicating, filtering to a conversion window, etc. in a single way with a few parameters that reference existing objects.

Well said!

We need more of this problem space exposed to engineers and not just for “analysts”.

I’ll share a couple other articles from a company that does a nice job explaining the technical problems in what is traditionally “business analytics”.

The space is OLAP and you may have scoffed at the idea of “OLAP cubes”, but man were they useful. In the way that excel powers a ton of business processes, cubes powered a lot of analytics. Underlying tech is cool but they are showing their age: https://www.holistics.io/blog/the-rise-and-fall-of-the-olap-...

Another write up of this idea of a semantic layer above raw sql statement: https://www.holistics.io/blog/holistics-data-modeling-explai...

So this “semantic layer” leverages the latest tech to deliver the same business insights faster, better, more flexibly. Ie once you define this semantic layer over your data (ie how all your sql tables are connected), the semantic engine knows how to query up and down your data model, writing the SQL queries for you, on the fly. You can ask and answer new questions without writing new queries. And with modern columnar query engines (eg big query, spark, presto, etc), perf is usually pretty good.

And for completeness, here’s another company that I also used at $previous_job that provides a “semantic model” offering. This write up also helps describe where it fits in.

(This one has just enough content vs marketing for me not to feel embarrassed posting here on HN for people who want to find out more. And IMO the BI landscape is littered with pablum from an engineers POV, often obscuring the nature of the technical problems to be solved in the space - which are very cool.)

https://www.atscale.com/blog/what-is-a-universal-semantic-la...