|
|
|
|
|
by hermitcrab
1206 days ago
|
|
If a CSV has quoting (e.g. because the data contains comma or quote chars) aren't you effectively forced to parse it in a single thread? See also:
'Why isn’t there a decent file format for tabular data?'
https://news.ycombinator.com/item?id=31220841 |
|
It is true there will be exceptions-- such as if you know you only want to read the second half the file only. In that case CSV with quoting does not give you a direct way to find that halfway point without parsing the first half.
I suppose whether this is worth the other pros/cons will be situation-dependent. For my use cases, which are daily, CSV parsing speed, when using something like xsv or zsv, has just, by itself, never been a material concern/impact on performance.
Where I think the CSV parsing downside is much greater than the fact that it must be serial (but which as described above does not prevent parallelized processing), is in type conversion not just of numbers but in particular of dates-- it can be expensive to convert the text "March 6, 2023" to a date variable. However, if you have control over the format, you could just as easily printed that as an integer such as 44991 and reduces the problem to one of integer conversion. Which is still always going to be slower than a binary format, but isn't so bad performance wise.