Hacker News new | ask | show | jobs
by oliver101 2133 days ago
Thanks! We love DBT and take a lot of inspiration from their work. We’re putting a lot of effort into suggesting the right tests based on the data types, sources, and field names. A lot of these tests are pretty repetitive to write so we want to make it easy to spin them up.

We’ve also found that keeping a history of the state of the warehouse over time is really useful context for determining whether a test has failed (example: this table tends to update every 30-40 minutes so we’ll set a threshold at an hour).

We also handle the scheduling, which is surprisingly annoying to manage (we built a couple of internal tools for this in the past). That’s something we really missed with great expectations (you get this with DBT cloud). Testing files is an interesting use case, to an extent we support this using Athena or Bigquery external tables for json/csv/parquet. We’re intentionally limiting it to SQL for now.

1 comments

Very interesting tool, I am trying to do this with Dataform/Looker, and feel like some kind of inference like below would be great.

> this table tends to update every 30-40 minutes so we’ll set a threshold at an hour

Can you achieve these tests with metadata or do you need 100% read access to the database?

I also wonder if this would work as part of a Analytics Engineering CICD process? Something like how dbt cloud will block pull requests that fail certain criteria.

Metadata is a valuable place for finding information like load times, rows inserted / updated. Currently we just rely on read-access and raw SQL. A common way users are doing this now (and we are internally for our analytics data) is using, for example, the Fivetran logs table to monitor ingestion times and inserted rows, rather than querying the raw tables.

For CICD, absolutely we want to support this as well as stopping/conditional execution in DAGs (e.g. airflow). We’re launching webhooks very soon