Hacker News new | ask | show | jobs
by aquafox 112 days ago
I really don't understand why people think it's a good idea to use csv. In english settings, the comma can be used as 1000-delimiter in large numbers, e.g. 1,000,000 for on million, in German, the comma is used as decimal place, e.g. 1,50€ for 1 euro and 50 cents. And of course, commas can be used free text fields. Given all that, it is just logical to use tsv instead!
9 comments

I learned to program at 33 or so (in bioinformatics), my first real lesson a couple of days in: "Never ever use csv". I've never used pd.read_csv() without sep="\t". Idk where csv came from, and who thought it was a good idea. It must have been pre-spreadsheet because a tab will put you in the next cell so tabs can simply never be entered into any table by our biologist colleagues.

I guess it's also why all our fancy (as in tsv++?) file types (like GTF and BED) are all tab (or spaces) based. Those fields often have commas in cells for nested lists etc.

I wish sep="\t" was default and one would have to do pd.read/to_tsv(sep=",") for csv. It would have saved me hours and hours of work and idk cross the 79 chars much less often ;)

Tabs can absolutely be entered into cells in multiple ways but the easiest is just copy paste.

And if it’s tab delimited usually people call them tsvs.

Funny story: I once bought and started up Galactic Civilizations 3.

It looked horrible, the textures just wouldn't load no matter what I tried. Finally, on a forum, some other user, presumably also from Europe, noted that you have to use decimal point as a decimal separator (my locale uses a comma). And that solved the problem.

CSV can handle commas in fields just fine (quotes are required in that case). The root problem here is not the format, it's a bug in the CSV exporter used.

https://news.ycombinator.com/item?id=47229064

Clearly this is the issue. This article was 2000 words of trying to work around the actual problem
It's one of those things where people think, it's there, and it works.

The whole business of software engineering exists in the gap between "it works today on this input" and "it will also work tomorrow and the day after and after we've scaled 10x and rewrote the serialization abstraction and..."

See also: "Glorp 5.7 Turbo one-shot this for me and it works!"

I've embedded large JSON blobs in CSVs. The format is fine and quite robust, just never open it in Excel unless you are prepared for your data to be silently broken, but that's Excel being abhorrent, not CSV: Libreoffice and Google Sheets don't do that.

The JSON in CSVs does piss off the Jetbrains table viewer sometimes though, it will randomly decide to stop parsing a 50k line CSV at halfway through the JSON of line 300ish even though that JSON is no different from other JSON it parsed just fine.

But python reads and writes them fine, as does whatever SQL engine I'm touching, as does other tools.

Yes, but tabs also can appear in text fields. If you are free to pick not csv, then perhaps consider feather or parquet?
If you snapped your fingers and removed CSVs from the world your lights would go out within the hour and you'd starve within the week. Trillions of dollars in business are done every day with separated values files and excel computations. The human relationships solve the data issues.
> I really don't understand why people think it's a good idea to use csv.

Because it's easy to understand. Non-technical people understand it. There is tremendous value in that, and that it's underspecified with ugly edge cases doesn't change that.

And you get the under reporting of COVID information in the UK as they passed around CSV files with too many rows for the tools they used.

An interchnage format needs to include information showing that you have all the data - e.g. a hash or the number of rows - or JSON/XML/s-expressions having closing symbols to match the start.

JSON, just use JSON. Or XML, if you don't like JSON.
JSON brings its own set of problems. for example, look at the python generated JSON below.

    >  >>> json.dumps({ "X" : 1 << 66 })
    > '{"X": 73786976294838206464}'
What's the parsing result in javascript ? What's the parsing result in Java ?
What's the difference to CSV?

  number,73786976294838206464
For CSV, I don't know how this comes out. It depends on the library/programming language. It might be 73786976294838210000 or it might throw an exception, or whatever. I'm just saying JSON will not solve your problems neither.
So it always depends on the implementation.

If you need something unambiguously specified, then XML with XSD is still a valid option. All number types are specified exactly, and you can use extensions for custom number types.

what's wrong with protobuf & friends ?