Hacker News new | ask | show | jobs
by dietr1ch 683 days ago
I think that we just need someone to get fed up and simply tackle the list of well known problems of CVS.

What we need is,

  - A standard (yeah, link xkcd 927, it's mentioned enough that I can recall it's ID) to be announced **after** the rest of things are ready.

  - Libraries to work with it in major languages. One in Rust + wrappers in common languages might get good traction these days. Having support for dataframe libraries right away might be necessary too.

  - Good tooling. I'm guessing one of the reasons CSV took off is that regular unix tools are able to deal with CVSs mostly fine (there's edge cases with field delimiters/commas, but it's not that bad).
The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.

This seems too much work to get right since the very beginning, so maybe building on top of Apache Arrow might help reduce the solution space.

3 comments

Most major languages have decent libraries, frameworks and tools for dealing with CSV. Those tend to have lots of tests for all the well known issues and edge cases. Especially in the python world, which is used for a lot of data processing, tooling is not really an issue. But most other languages also have decent frameworks. Most of that stuff covers the few standards that exist for this, the well known variants of the format that are out there (quite a few) and can deal with the quirks of those.

The only time people get in trouble with CSV is when they skip using those tools, hack something together, and then get it wrong.

> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them

There's no need for new stuff. It would be redundant as there are several things already that do these things. Adding more isn't helpful. The problem is most of the stuff that supports CSV tends to support none of those things and fixing a lot of ancient systems to retrofit them with e.g. parquet support or whatever is a mission impossible. CSVs principle feature is that it is simply everywhere. That's hard to replicate. People have been trying for decades.

> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.

Parquet fits the bill here. It's not perfect (there is no perfect file format), but it's a practical compromise as of today, at least for new systems where a columnar format is appropriate. There are some columnar formats that are better in some aspects (like ORC and some proprietary formats) but they're not as widely supported.

It's not that CSV/TSV is bad in every situation, but more that CSV/TSV is overused for things it shouldn't be used for. (CSV is good as for tabular format for simple applications, very bad as the storage format for data lakes or anything you want to query, questionable as an data exchange format, okay as a semi-structured format for structurally simple data -- many open data platforms offer it as a a download format and it generally works).

To get a sense of how much variation a CSV reader needs to handle, we can take a look at the number of arguments there are in Pandas' read_csv. And it still fails on some CSVs! (I've had to preprocess CSVs before pd.read_csv would work)

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...

CSV is not king, but it is popular. But popularity doesn't mean it's good for every use case. Optimizing for human readability and easy generation means trading off other very important characteristics (type safety, legibility across different tooling, random access performance, reliability/consistency).

You can't do anything about legacy systems, but when designing a new system, you should really ask yourself: is CSV really the right choice?

(With DuckDB, the answer for me is increasingly no)

> Libraries to work with it in major languages. One in Rust

burntsushi is nine years ahead of you: https://crates.io/crates/csv

Yeah, I used it about 7-8 years ago. I liked the idea of chaining things, but it's very clear that csv has not been holding up well in the past decades.

Also, what I have in mind for file sharding needs maybe a standard on top of a record/column file format. The successor to CSV should be easy to process in parallel.