|
|
|
|
|
by dietr1ch
683 days ago
|
|
I think that we just need someone to get fed up and simply tackle the list of well known problems of CVS. What we need is, - A standard (yeah, link xkcd 927, it's mentioned enough that I can recall it's ID) to be announced **after** the rest of things are ready.
- Libraries to work with it in major languages. One in Rust + wrappers in common languages might get good traction these days. Having support for dataframe libraries right away might be necessary too.
- Good tooling. I'm guessing one of the reasons CSV took off is that regular unix tools are able to deal with CVSs mostly fine (there's edge cases with field delimiters/commas, but it's not that bad).
The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.This seems too much work to get right since the very beginning, so maybe building on top of Apache Arrow might help reduce the solution space. |
|
The only time people get in trouble with CSV is when they skip using those tools, hack something together, and then get it wrong.
> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them
There's no need for new stuff. It would be redundant as there are several things already that do these things. Adding more isn't helpful. The problem is most of the stuff that supports CSV tends to support none of those things and fixing a lot of ancient systems to retrofit them with e.g. parquet support or whatever is a mission impossible. CSVs principle feature is that it is simply everywhere. That's hard to replicate. People have been trying for decades.