Hacker News new | ask | show | jobs
by SAI_Peregrinus 815 days ago
The main issue is that "CSV" isn't one format with a single schema. It's one format with thousands of schemas and no way to communicate them. Every program picks its own schema for CSVs it produces, some even change the schema depending on various factors (e.g. the presence or absence of a header row).

RFC 4180 provides a (mostly) unambiguous format for writing CSVs, but because it discards the (implied) schema it's useless for reading CSVs that come from other programs. RFC 4180 fields have only one type: text string in US-ASCII encoding. There are no dates, no decimal separators, no letters outside the US-ASCII alphabet, you get nothing! It leaves the option for the MIME type to specify a different text encoding, but that's not part of the resulting file so it's only useful when downloading from the internet.

1 comments

> RFC 4180 provides a (mostly) unambiguous format for writing CSVs,

What are the ambiguities in RFC 4180?

It allows non-ASCII text but does not provide any way to indicate charset within the file, instead requiring it out-of-band. Once the file is saved, the text encoding becomes ambiguous. Likewise for the presence or absence of a header row.

Likewise for whether double quotes (`"`) are allowed in fields (rule 5). This one gets even worse, since the following rule (6) uses double quotes to escape line breaks and commas, but they may not be allowed at all so commas in fields may not be escapable.

It only supports text, not numbers, dates, or any other data, and provides no way to indicate any data type other than text.