| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gregw2 1065 days ago

The "diff" described in this article is very different from "diff" in Unix, it's a "fancy data science data comparison tool", not a "pedestrian data engineering tool" comparing two tables.

If you want a SQL diff that works more like Unix diff, to validate that two huge tables are the same and data didn't get corrupted in some ETL process for example and that no rows got duplicated or column values got mangled, you can use the following technique which basically just uses ordinary GROUP BY/UNION ALL/HAVING SQL operators in a set-based way:

https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_...

I have compared a pair of billion-row tables in under a minute on a columnar database with this technique.

This can be handy for regression testing algorithmic results that produce large datasets or for testing certain types of data migrations.