Hacker News new | ask | show | jobs
by snidane 1464 days ago
What benefit does it give me over running table diff query in SQL?

  select * exclude (date_uploaded) from dev_table
  except
  select * exclude (date uploaded) from prod_table
or

  select *
  from (select * exclude (date_uploaded) from dev_table) dev
  natural full join
  (select * exclude (date_uploaded) from prod_table) prod
  where dev.date_uploaded is null
    or prod.date_uploaded is null

The only issue with the above is that EXCLUDE/EXCEPT is missing from standard SQL and even from market leaders like Snowflake, making this a massive pain in the ass. Second, natural joining in presence of null fields is going to produce a mess instead of something useful. Again - analytics db providers would rather boast about adhering to an ancient standard from the 1970s than listening to users and actually making SQL work after all those decades of pain.

Without the stupid default behavior of SQL, this wouldn't be a problem. I'm curious if Data Diff solves this or some other use case.

2 comments

data-diff has several advantages over running the query you wrote here.

- data-diff can compare tables across different databases. Your query is limited to one database.

- For very big tables, your 'select' will time-out. data-diff splits the diff into small segments, so we side-step this issue.

- data-diff supports running in threaded mode, which means it can finish a LOT faster. (especially for cloud databases.)

Snowflake doesn't support this kind of except, which is needed for the diff to work seamlessly:

  select * except (column1, column2, ...)