Hacker News new | ask | show | jobs
by verhey 2130 days ago
How does hubble compare to Great Expectations or DBT for pipeline testing? It looks like more emphasis on automated profiling than "having to write and maintain lots of individual tests" and obviously hubble being a saas offering is the big difference?

Also any plans to profile and test file-based stores as well? There's a lot that can go wrong in a pipeline before data even reaches BigQuery or Snowflake, and you may help your customers save money if you could profile data in S3 before it goes through a potentially expensive transform process.

Best of luck, though! Data testing is a very real need in most data organizations I've been in, and I'm glad more and more tools seem to be popping up recently to help with it.

1 comments

Thanks! We love DBT and take a lot of inspiration from their work. We’re putting a lot of effort into suggesting the right tests based on the data types, sources, and field names. A lot of these tests are pretty repetitive to write so we want to make it easy to spin them up.

We’ve also found that keeping a history of the state of the warehouse over time is really useful context for determining whether a test has failed (example: this table tends to update every 30-40 minutes so we’ll set a threshold at an hour).

We also handle the scheduling, which is surprisingly annoying to manage (we built a couple of internal tools for this in the past). That’s something we really missed with great expectations (you get this with DBT cloud). Testing files is an interesting use case, to an extent we support this using Athena or Bigquery external tables for json/csv/parquet. We’re intentionally limiting it to SQL for now.

Very interesting tool, I am trying to do this with Dataform/Looker, and feel like some kind of inference like below would be great.

> this table tends to update every 30-40 minutes so we’ll set a threshold at an hour

Can you achieve these tests with metadata or do you need 100% read access to the database?

I also wonder if this would work as part of a Analytics Engineering CICD process? Something like how dbt cloud will block pull requests that fail certain criteria.

Metadata is a valuable place for finding information like load times, rows inserted / updated. Currently we just rely on read-access and raw SQL. A common way users are doing this now (and we are internally for our analytics data) is using, for example, the Fivetran logs table to monitor ingestion times and inserted rows, rather than querying the raw tables.

For CICD, absolutely we want to support this as well as stopping/conditional execution in DAGs (e.g. airflow). We’re launching webhooks very soon