Hacker News new | ask | show | jobs
by craig_peacock 3191 days ago
This is an overly complicated data container format for not much advantage. To be honest, everything you can do with this can be done at the same level or better with SQLite, an actual database system. Having to implement 4 different parers and validation functions spanning a mix of csv, xml and json just to access what is essentially a csv file is not feasible.
1 comments

I agree that SQLite is amazing, and the problem that I had with some of the datapackage implementations (CSVLint) is that they stored validation errors in-memory (this is a deal breaker for data sets larger than a few hundred MB) and didn't work well when cross-validating data between multiple files. That's why I created ETLyte (https://github.com/sorrell/etlyte) which reads data into a SQLite DB, writes errors to the DB, and streams output to file/stdout.

I disagree that there is "not much advantage" in the format though. I use much of the "resources" area of the data container format and find it tremendously helpful for validating the expected datatypes (remember, SQLite has no true datatypes for columns), defining expected values, and defining some of the "ETL" functionality in ETLyte, like derived columns.

Also on the horizon is a fuzzing tool I'm creating to help exercise the boundaries and variations of data that an ETL process can expect, and this wouldn't be possible without a data container format. So again, I think there are very good use cases for it that we haven't even tapped into yet.