Hacker News new | ask | show | jobs
by usrbinbash 667 days ago
> CSVs are kinda bad.

Not really.

What's bad is when people keep insisting on coming up with new and amazing CSV dialects.

https://www.ietf.org/rfc/rfc4180.txt is very clear about what CSV files are supposed to look like, and the fact that people keep ignoring this for whatever reason, is not the formats problem.

And no, "using another format" is not a solution to this. Because: I can just invent a new DSV dialect. Or a JSON dialect. Or a dialect where the field separator is "0xFF00FF00" and the row separator is the string `DECAFCOFFEE` encoded in EBCDIC, all other characters have to be UTF-32, except for a, b and f, which also need to be EBCDIC encoded.

> For starters, it’s rather unreadable when opened in a text editor. But I bet you don’t really do that with your CSVs all that often anyway!

Wrong. I do that with csv files all the time. In fact I even have an amazing vim plugin just for them [0]. That's pretty much the point of having a plaintext tabular data storage format: That I can view and edit it using standard text wrangling utilities.

---

There is a much simpler solution to this problem: Don't accept broken CSV. If people keep ignoring standards, thats their problem.

[0]: https://github.com/mechatroner/rainbow_csv

6 comments

"Broken" is a sliding scale, and it's unfeasible to refuse engaging at all times.

If you are a multi-billion dollar company creating a new integration, you can demand that your small supplier provide an RFC-4180 compliant file, and even refuse to process it if its schema or encoding is not conformant.

If you are the small supplier of a multi-billion dollar company, you will absolutely process whatever it is that they send you. If it changes, you will even adapt your processes around it.

TFA proposes a nice format that is efficient to parse and in some ways better than CSV, another ways are not. Use it if you can and makes sense.

I agree up to a point. It is a kind of tug-o-war, and yes, the weight of each side plays an important role there.

Nevertheless, even in projects where my services are talking to something that's bigger, I will, at the very least ask "why cant it be RFC compliant? is there a reason?". And without blowing my own horn overly much, but quite a few systems larger than mine have changed because someone asked that question.

> https://www.ietf.org/rfc/rfc4180.txt is very clear about what CSV files are supposed to look like

Mm, not really. By its own admission, it is descriptive, not prescriptive:

> This section documents the format that seems to be followed by most implementations

And it came out in 2005, by which date CSVs had already been in use for some twenty or thirty years.

It doesn't matter when it came out, it doesn't matter that it it descriptive. It is the standard, period.

Yes, CSV is much, much older. In fact it predates personal computers. And it went through changes. Again: None of that matters. We have a standard, we should use the standard, and systems should demand the standard.

Standards are meant to ensure minimal-friction interoperability. If systems don't enforce standards, then there is no point in having a standard in the first place.

Yes, but you could argue that web browsers shouldn't accept broken HTML either. But they do, and that's why there are so much broken HTML out there in the wild. Same with broken CSV -- basically people's measure is "if Excel can read it correctly, it's fine" even if not every CSV library in every programming language can.
"This memo provides information for the Internet community. It does not specify an Internet standard of any kind."
Note the qualifier: “not an Internet standard” (my emphasis).
And again: None of that matters. I am not talking about formalities here, I am talking about technical realities.

Whether it is formally called a standard or no doesn't change the fact that this is the document everyone points at when determining what CSV is and is supposed to look like. So it is de-facto a standard. Call it a "quasi standard" if that makes you happy.

Oh no; I agree with you completely. I just wanted to point out that the document does not disclaim being a “standard”, is just says that it is not an “Internet standard”.
> Don't accept broken CSV. If people keep ignoring standards, thats their problem.

From the very memo you link to (RFC 4180):

> Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files.

Oh, I am nothing but liberal when it comes to CSV: Clients get the liberty to either have their requests processed, or get a 400 BAD REQUEST

And yes, I am aware that the standard says this. My counter question to that is: How much client-liberty do I have to accept? Where do I draw the line? How much is too much liberty?

And the answer is: there is no answer. Wherever any system draws that line, it's an arbitrary decision; Except for one, which ensures the least surprise and maximum interoperability (aka. the point of a standard): to be "conservative", and simply demand the standard.

I think the suggestion reflected a deep understanding that transitioning from decades of wild-west to standardized in the smooth fashion most likely to succeed would require that strategy.

If you don’t accept whatever some org’s data is encoded with, they won’t consider it a win for standards, or swap out whatever is producing that data for something more compliant. They’ll consider it a bug, and probably use some other more flexible processor.

On the other hand, if you can be flexible enough to allow quirks on import while not perpetuating them on export, eventually you and other software built with the same philosophy standardize the field.

I do think there’s a point where things are standardized enough that you can safely stop doing that—when all the extra quirk code is so rarely used as to be irrelevant—but I’m unsure if we’ve reached it yet. It would be something to actually analyze, though, rather than just a philosophical decision.

> On the other hand, if you can be flexible enough to allow quirks on import while not perpetuating them on export, eventually you and other software built with the same philosophy standardize the field.

How? The only thing I can see happening is perpetuation of sloppy use of standards. "Why, why should I change my |-deliminated CSV dialect that requires a double-semicolon at the end of each row, which is arbitrarily denoted by either \n or \r or \n\r when all those programmers will accomodate me, no matter how little sense it makes to do so?

> I do think there’s a point where things are standardized enough that you can safely stop doing that

I agree. And that point was when someone sat down, and penned RFC-4180

Everything after that point, has to justify why it isn't RFC compliant, not the other way around.

> In fact I even have an amazing vim plugin just for them

So this is gold. Editing xSV files has been an ongoing pain, and this plugin is just amazingly awesome. Thanks for the link to it.

My pleasure :-)
you mean to say that vim can't handle simple character substitution? /s
No it isn't in the real world. It's very much your problem if you're the team consuming these files. Try to go tell the head of accounting they need to make all their data rfc4180 compliant see how that goes
> Try to go tell the head of accounting they need to make all their data rfc4180 compliant see how that goes

Fun fact: I did. And not just for accounting systems, but all sorts of data ingestion pipelines. Did it work every time? No. Did it work in many cases? Yes. Is that better? Absolutely.

Here is the thing: If I accept broken CSV, where do I stop? What's next? Next thing my webservice backends have to accept broken HTTP? My JSON-RPC backends have to accept JSON with /*/ style block comments? My ODBC load-balancer has to accept natural language instead of SQL statements (I mean, its the age of the LLM, I could make that possible).

I draw the line at, the source keeps changing how it's broken.

If things are broken, but in a predictable, standard for that source way... uggh but at least it's their standard and if some tweak gets the common tools working for that one standard then everyone can move on and be happy.