Hacker News new | ask | show | jobs
by firebacon 2060 days ago
There is https://tools.ietf.org/html/rfc4180

Most CSV files do not follow this standard of course. But you could normalize all CSV files to RFC4180 (or any other consistent format) as the first step of your processing pipeline.

3 comments

The issue with encountering CSV in the wild is that everybody who appreciates standards and interoperability ditched it a long time ago. If you are consuming CSV files in the wild, you can be sure that whoever is supplying them to you is using horrible tools to create them and will be unwilling or unable to address issues you find in them.
> The issue with encountering CSV in the wild is that everybody who appreciates standards and interoperability ditched it a long time ago.

I worked on a team that used CSV somewhat extensively. For the data we generated, it was RFC complaint. It's pretty trivial to get RFC-compliant CSVs, too; most languages have a library — ours was in the standard library, too.

We also had a ("terrible", as we joked) idea to create a subset of CSV that would contain typing information in a required header row. (We never did it, and it is a bad idea.)

> If you are consuming CSV files in the wild, you can be sure that whoever is supplying them to you is using horrible tools to create them and will be unwilling or unable to address issues you find in them.

…but this is absolutely true. We also consumed CSVs from external sources and contractors, and this was an absolute drain on our productivity. I've also worked with engineers of this caliber, and changing CSV wouldn't change the terrible output. I've seen folks approach eMail, HTTP with a cavalier "oh, it's a trivial text format, I don't need a library!" attitude, and inevitably get it wrong. Pointing out the flaws in their implementation and that a library would fulfill their use-case just fine is just met with more hacks (not fixes) to try to further munge the output into shape. It is decidedly not software engineering. I've seen this even with JSON.

But yeah, even with RFC standard CSV, you shouldn't be parsing it with awk. It is the wrong tool.

I generally went for turning them into either tab-separated files or used the ASCII codes for record separator and its brethren depending on the job. I never wanted to touch CSV again after parsing it once.
But if you’re writing code to normalize before sending to awk, why not just process in the normalization program instead of using awk’s bizarre syntax?
Depends on what you're using to normalize I suppose. Maybe that's Awk too! Maybe Awk is easier for exploration when the data is already clean but you have to write some parsing layer in Java or something and that's not conducive to one-liner exploration.