Hacker News new | ask | show | jobs
by curragh 3185 days ago
I agree that SQLite is amazing, and the problem that I had with some of the datapackage implementations (CSVLint) is that they stored validation errors in-memory (this is a deal breaker for data sets larger than a few hundred MB) and didn't work well when cross-validating data between multiple files. That's why I created ETLyte (https://github.com/sorrell/etlyte) which reads data into a SQLite DB, writes errors to the DB, and streams output to file/stdout.

I disagree that there is "not much advantage" in the format though. I use much of the "resources" area of the data container format and find it tremendously helpful for validating the expected datatypes (remember, SQLite has no true datatypes for columns), defining expected values, and defining some of the "ETL" functionality in ETLyte, like derived columns.

Also on the horizon is a fuzzing tool I'm creating to help exercise the boundaries and variations of data that an ETL process can expect, and this wouldn't be possible without a data container format. So again, I think there are very good use cases for it that we haven't even tapped into yet.