Hacker News new | ask | show | jobs
by lmwnshn 1526 days ago
Would you be able to provide examples of interesting facts, and is this tool (or anything similar) open-source or described publicly? Thanks!
1 comments

Interesting facts about the tool, we created the parser with speed and efficiency in mind, it is able to process ~5M queries in less than an hour, since many of the SQL statements are associated with scheduled pipelines, I came up with a way of normalizing them and generating a signature we use to skip them if they have not changed, that saved a lot of processing when parsing.

We also created some other datasets that tell you how the tables are normally joined and another team created a ML model that now we use in an internal tool that automatically recommends the best join keys when you join 2 or more tables (since we have lots of historical info on that already stored).

We also had to create a very efficient table profiler to evaluate the candidates, because in some cases there are columns that are widely used in equi-where conditions, but their cardinality is very high, making them bad partition columns.

One guy created a parser that actually gets the most common values used to filter each column, I guess we could use that in the future to materialize some views; my original vision of the project was to continue with partial aggregations for common computations. The thing is there are several pieces of code that are pretty much copy/pasted and reused in many pipelines, so why not materialize those and rewrite the SQL of the subsequent pipelines to leverage the materialized version? huge savings there.

We are planning to present it in VLDB or a similar forum, there are some aspects of it that we need to 'clean' if we want to open source it.

Other parts of the system include the candidate evaluation and the module that computes the expected savings for the best candidate selected during evaluation; this system in particular has a lot of specific Presto and Spark logic that might need to get more general if we want to open source it.

Thanks for the detailed response, looking forward to the VLDB paper when it happens! Your vision sounds cool. I wonder if you could get most of the way there by exposing the workload (across different pipeline stages) to a materialized view recommender.

In the class project mentioned elsewhere, I found normalizing queries to be pretty slow in practice (naive standardized formatting + query templatization, tried various Python libraries, settled on pglast). I didn't think about trying "skip if fingerprint matches", which may help considerably. Fast normalization is nice! :)

> I wonder if you could get most of the way there by exposing the workload (across different pipeline stages) to a materialized view recommender

yes! that's something we are trying to do, since we have a way to create signatures for SQL statements and subqueries are just SQL statements then we can get all the "signatures" a query use and compare if other queries are using the same signatures. Then just sort those queries by number of times used and put some other perf metrics like IO/CPU needed to compute it and you get a good starting point.

Microsoft did something similar with Azure, using bipartite graphs, their solution was more advanced as they also baked in constraints like "the materialized view can't be more than X GB in size" but the end result is the same. (https://www.microsoft.com/en-us/research/uploads/prod/2018/0...)