| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shadowwolf007 1769 days ago

Yeah - I used to lead a department that would process somewhere around 10TB of CSV formatted data per day.

The edge cases are a hassle but they don't become less of a hassle from a business perspective by switching to json or really any other format. We tried an experiment of using more json and eventually gave it up because it wasn't saving any time at a holistic level because the "data schema" conversations massively dominated the entirety of the development and testing time.

Obviously being able to jam out some json helped quite a bit initially, but then on the QA side we started to run in to problems with tooling not really being designed to handle massive json files. Basically, when something was invalid (such as the first time we encountered an invalid quote) it was not enjoyable to figure out where that was in a 15GB file.

That said, I fully concur with the general premise that CSV doesn't let you encode the solutions to these problems, which really really sucks. But, to solve that, we would output to a more columnar storage format like Parquet or something. This would let us fully encode and manage the data how we wanted while letting our clients continue working their processes.

What I would really like to see is a file format where the validity of the file could be established by only using the header. E.g. I could validate that all the values in a specific column were integers without having to read them all.

3 comments

anigbrowl 1768 days ago

Really appreciate the insight from you and the GP here. I have been struggling with data format decisions around a personal project that will only be used by a few people, being unsure about the extent i should try to make it bulletproof (but harder to maintain and modify) or just keeping it simple (but primitive). It's helpful to see an experienced professional perspective showing that you can fall into a tooling rabbit hole at any scale.

link

breck 1768 days ago

> "data schema" conversations massively dominated the entirety of the development and testing time.

Agreed. JSON let's me know something is a number. That's great, but I still have to check for min/max,zero etc. A string? That's great, but I got to check it against a set of enums, and so forth. Basically, the "types" JSON gives you is about 20% of the work, and you're going to have to parse things into your own types anyway.

> What I would really like to see is a file format where the validity of the file could be established by only using the header.

Are you saying something like a checksum so not only is a schema provided but some method to verify that the data obeys the schema?

If you're talking about just some stronger shared ontology, I think that's a direction things will go. I call this concept "Type the world" or "World Wide Types". I'm starting to think something like GPT-N will be the primary author, rather than a committee of humans like Schema.org.

link

shadowwolf007 1768 days ago

Honestly with the schema thing I'd probably be fine with either/or!

A checksum would be crude and user-hostile, only being able to say "you did it wrong" but not really good at tell you what it means to do it right.

If I understand the concepts correctly then it seems like a shared ontology could potentially solve the problem in a non-hostile way.

Plus, it makes me happy because I feel like types are a real-world problem, so it is always nice if the type system could enforce that real-world-ness and all the messiness that comes along for the ride.

link

radus 1768 days ago

Would DuckDB (https://duckdb.org/) work as your file format with enforced column types?

link

shadowwolf007 1768 days ago

We looked at it and there were a few problems we had with where it would force us to put VMs that we just weren't super comfortable with due to the in-process-ness.

More a byproduct of decisions made 5 - 7 years ago when the company was in raw startup mode versus a more mature roadmap.

link