Hacker News new | ask | show | jobs
by tlento 1530 days ago
Three things:

First, MetricFlow does not currently support MySQL. We launched with support for BigQuery, Redshift, and Snowflake. I have opened an issue to add support for MySQL (and similar issues for other SQL engines are coming): https://github.com/transform-data/metricflow/issues/27

Second, what we call a data source is more similar to a table in a database, rather than the underlying database service itself. Metricflow itself is useful when you're using a single SQL engine - indeed, that's all we support today - but it is most useful when you're in a world where joins are a thing. That said, if you have one big data table you might still find it useful to have declarative metric definitions defined in Metricflow. Suppose, for example, you had a big NoSQL style table filled with JSON objects. You might define a few data sources that normalize those JSON objects into top level elements (identifiers, dimensions, aggregated measures) using the sql_query data source config attribute, and then that'd allow you to support structured queries on the data consumption end while pushing unstructured blobs from your application layer. This will be slow at query time, and only as reliable as the level of discipline exerted in your application development workflow, but it's possible.

Third, if we did support MySQL you'd basically connect to it via standard connection parameters - we have a config file where you can store the required information and then we'll manage the connections for you. However, I'm not familiar with uxwizz, and a quick perusal of their documentation did not turn up how one goes about connecting to the underlying DB. It's likely I just missed this, but at any rate I don't know how it is done. If they don't support standard MySQL client connections you'd need to write an adapter of some kind against whatever DB connection APIs they provide, in which case you'd likely need to roll a custom implementation of MetricFlow's SqlClient interface and initialize the MetricFlowEngine with that.

1 comments

Thanks for the response!

With data sources, I mean if you have multiple services/products delivering data (e.g. an analytics platform, a CRM, all hosted on different servers).

> uxwizz, and a quick perusal of their documentation did not turn up how one goes about connecting to the underlying DB

UXWizz is self-hosted, so you just have a basic MySQL/MariaDB database that you have full control over (so you can connect remotely with the host/db/username/password you create). This is the database structure: https://docs.uxwizz.com/guides/database-querying

> you have one big data table you might still find it useful to have declarative metric definitions defined in Metricflow

Oh, so this is like saving queries? Why would I write the MetricFlow config instead of saving the SQL query directly? I had look over https://docs.transform.co/docs/metricflow/guides/introductio... but I found the concepts and config file a bit hard to understand (but maybe I'm not in the target audience).

Ah, I see!

> With data sources, I mean if you have multiple services/products delivering data (e.g. an analytics platform, a CRM, all hosted on different servers).

MetricFlow does not support this today. The model we are working with is of an analytics team relying on a centralized data warehouse service - hence the initial support for Redshift/Snowflake/BigQuery instead of Postgres/MySQL.

Deriving data from multiple input services is actually very complicated, because at some point you need to solve the cross-service join (or union) problem. This requires us to either merge everything into a single service layer (which isn't really appropriate for MetricFlow, there are entire service packages dedicated to just this problem) or keep track of input data lineages throughout the metric model.

My recommendation, if you need this, is to get an ETL service to transfer data from these different data service layers into a unified warehouse and then use that as your MetricFlow input source. This will be cleaner and easier to manage for you in the long run, even though it adds a bit of cost up front.

> UXWizz is self-hosted, so you just have a basic MySQL/MariaDB database

Oh, nice, that was the bit I was missing. In this case we will support it as soon as someone adds support for a MySQL client, but you'll likely have to bypass all of the UXWizz bindings and connect to the MySQL instance directly.

> Oh, so this is like saving queries? Why would I write the MetricFlow config instead of saving the SQL query directly?

Yes, it is, and you probably wouldn't want to do this in MetricFlow.

MetricFlow's data source config allows you to specify a SQL query inline:

https://docs.transform.co/docs/metricflow/guides/best-practi...

The data source itself essentially gives MetricFlow a base "table"-like construct that we can query on behalf of the user. So you define measures (which are basically aggregations), dimensions (attributes used for grouping and filtering), and identifiers (which you can use to link to other dimensions).

Ideally this would all be stored in a table in your data service already, and you can just provide us with a pointer to the table identifier and use the measure/dimension/identifier elements to provide what amounts to a very simple view over the underlying SQL table. You'd do this less for its own sake and more because that is the basis of the consistent metric computation and simplified dimension access MetricFlow provides.

The SQL query construct is in place to allow you to do some lightweight manipulation of the input tables in MetricFlow, because sometimes people need to do a little bit of filtering or transformation and either can not or prefer not to do this at a lower layer. When placed inside of MetricFlow proper these tend to be extremely simple.

The example I gave about splitting a massive JSON blob table is quite extreme, and I should have been more clear about this - I would not recommend using MetricFlow to do it. That's better handled by something end users of the metric system never have to see in any way at all, whether it be handled by traditional ETL processing on ingestion from the telemetry system into the warehouse, or by a data mart definition layer like dbt, or by something in between like a stack of Airflow operators. If you're in MySQL and your volumes are small you could even use views and then reference the view in MetricFlow (at least in theory, we've never tried anything like this).