Hacker News new | ask | show | jobs
by segah 2639 days ago
wow, impressive! how about symmetric aggregates (e.g. being able to do correct summation/aggregation on numeric values despite a one_to_many join)?
1 comments

Yes, look at this page - https://mprove.io/docs/blockml/fields/measure. You need to specify measure "type" and "sql_key" that will be used to avoid counting duplicates.
How do the underlying queries look for symmetric aggregates? Sadly SQL never supported the ability to compute an aggregate based on the unique values of another column.
Mprove does it the way the Looker did it before:

   CREATE TEMPORARY FUNCTION mprove_array_sum(ar ARRAY<STRING>) AS
  ((SELECT SUM(CAST(REGEXP_EXTRACT(val, '\\|\\|(\\-?\\d+(?:.\\d+)?)$') AS FLOAT64)) FROM UNNEST(ar) as val));
...

   SELECT COALESCE(mprove_array_sum(ARRAY_AGG(DISTINCT CONCAT(CONCAT(CAST(a.id AS STRING), '||'), CAST(a.population AS STRING)))), 0) as a_cohort_size
Recently, Looker began to do it differently, most likely to improve bigquery performance:

  COALESCE(ROUND(COALESCE(CAST( ( SUM(DISTINCT (CAST(ROUND(COALESCE(lesson_5_cohorts.population ,0)*(1/1000*1.0), 9) AS NUMERIC) + (cast(cast(concat('0x', substr(to_hex(md5(CAST(lesson_5_cohorts.id  AS STRING))), 1, 15)) as int64) as numeric) * 4294967296 + cast(cast(concat('0x', substr(to_hex(md5(CAST(lesson_5_cohorts.id  AS STRING))), 16, 8)) as int64) as numeric)) * 0.000000001 )) - SUM(DISTINCT (cast(cast(concat('0x', substr(to_hex(md5(CAST(lesson_5_cohorts.id  AS STRING))), 1, 15)) as int64) as numeric) * 4294967296 + cast(cast(concat('0x', substr(to_hex(md5(CAST(lesson_5_cohorts.id  AS STRING))), 16, 8)) as int64) as numeric)) * 0.000000001) )  / (1/1000*1.0) AS FLOAT64), 0), 6), 0) AS lesson_5_cohorts_m_sum_distinct