Hacker News new | ask | show | jobs
by aronhegedus 812 days ago
My takeaway is that csv has some undefined behaviours, and it takes up space.

I like that everyone knows about .csv files, and it's also completely human readable.

So for <100mb I would still use csv.

1 comments

If both parties implement RFC 4180 and use a consistent character set encoding then I don't think there are actually any undefined behaviors. But in practice a lot of implementations are simply broken, including those from major tech companies that ought to know better.
I don't think RFC 4180 differentiates between an empty string and a null value. As long as you add a check that all string columns are free of empty values before writing you should be good.

I think in polars it's

    df.filter(pl.col(pl.Utf8).str.len_bytes() == 0).shape[0] == 0
although there's probably a better way to write this.
Well I would consider differentiation between empty string versus null as simply being out of scope for CSV rather than undefined behavior. It was never intended as a complete database dump format.
And the application doesn't try to convert the cells into non-string data types like numbers, dates, etc.
Converting strings into other data types is out of scope for CSV, not really undefined behavior. The type conversions happen at a later stage of the import process.
It's out of scope for the RFC, but it could still be undefined behavior for the import/export process.