| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gwerbin 56 days ago
	There is still no good "data diff" tool that I can run on, say, a big pile of CSV or Parquet. Something with DVC integration would be especially welcome.

1 comments

appplication 56 days ago

I would imagine because at scales where most folks use parquet files, you’re generally no longer really thinking in terms of individual diffs to your data (and also does imply some level of batch processing, vs e.g. a DB).

We have some custom data diff tools at my ultracorp that provide a browsable interface, but the customer tends to be more operations folk than engineers or DS etc who would be more familiar with actual version control concepts. But these work against the data store and not on something like csv or parquet.

gwerbin 55 days ago

Sorta? Maybe I'm weird. I tend to use Parquet files inside my project instead of reading directly from and writing directly to our data warehouse. That lets me cut out a lot of overhead spent on just waiting for data to flow over the network, and also as a side benefit lets me track everything with DVC, which itself has a lot of benefits like being able to summon all project data with `dvc pull`.

I consider that a completely distinct use case from, say, Iceberg tables in S3.