| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aronhegedus 812 days ago

My takeaway is that csv has some undefined behaviours, and it takes up space.

I like that everyone knows about .csv files, and it's also completely human readable.

So for <100mb I would still use csv.

1 comments

nradov 812 days ago

If both parties implement RFC 4180 and use a consistent character set encoding then I don't think there are actually any undefined behaviors. But in practice a lot of implementations are simply broken, including those from major tech companies that ought to know better.

link

senknvd 812 days ago

I don't think RFC 4180 differentiates between an empty string and a null value. As long as you add a check that all string columns are free of empty values before writing you should be good.

I think in polars it's

    df.filter(pl.col(pl.Utf8).str.len_bytes() == 0).shape[0] == 0

although there's probably a better way to write this.

link

nradov 812 days ago

Well I would consider differentiation between empty string versus null as simply being out of scope for CSV rather than undefined behavior. It was never intended as a complete database dump format.

link

thayne 812 days ago

And the application doesn't try to convert the cells into non-string data types like numbers, dates, etc.

link

nradov 812 days ago

Converting strings into other data types is out of scope for CSV, not really undefined behavior. The type conversions happen at a later stage of the import process.

link

thayne 812 days ago

It's out of scope for the RFC, but it could still be undefined behavior for the import/export process.

link