|
|
|
|
|
by shadowwolf007
1769 days ago
|
|
Yeah - I used to lead a department that would process somewhere around 10TB of CSV formatted data per day. The edge cases are a hassle but they don't become less of a hassle from a business perspective by switching to json or really any other format. We tried an experiment of using more json and eventually gave it up because it wasn't saving any time at a holistic level because the "data schema" conversations massively dominated the entirety of the development and testing time. Obviously being able to jam out some json helped quite a bit initially, but then on the QA side we started to run in to problems with tooling not really being designed to handle massive json files. Basically, when something was invalid (such as the first time we encountered an invalid quote) it was not enjoyable to figure out where that was in a 15GB file. That said, I fully concur with the general premise that CSV doesn't let you encode the solutions to these problems, which really really sucks. But, to solve that, we would output to a more columnar storage format like Parquet or something. This would let us fully encode and manage the data how we wanted while letting our clients continue working their processes. What I would really like to see is a file format where the validity of the file could be established by only using the header. E.g. I could validate that all the values in a specific column were integers without having to read them all. |
|