|
|
|
|
|
by akbo
2485 days ago
|
|
These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether. Why? Because the file format is an underspecified, unsafe mess. When you get this kind of a file you have to manually specify its schema when reading it because the file doesn't contain it. Also, due to its underspecification, there are many unsafe implementations that produce broken files that cannot be read without manual fixing in a text editor. Let's just start using safe, well-specified file formats like AVRO, Parquet or ORC. As a data scientist, I have had lots of issues because the data I got for a project was a CSV/TSV/DSV file. I recently spat out a rant on this topic, so if you want more details, check out https://haveagooddata.net/posts/why-you-dont-want-to-use-csv... |
|
We can't stop, because it's a de facto standard format for exchange with spreadsheet programs. So long as that's ubiquitous, we might as well write tools to make processing them easier.
Also, I'm not sure why you called CSV unsafe. It's certainly the case that it's severely under-specified, but I don't think there's anything unsafe about it.