| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by itohihiyt 616 days ago
	What's wrong with CSV? I love me a CSV file.

2 comments

adammarples 616 days ago

Having the control characters mixed in with the data, and having no defined encoding.

link

mschuster91 616 days ago

Out of the random shit I had to deal with CSV file wrangling in the last two years:

- no defined encoding, so it may be anything from US-ASCII over ISO-8859-x to UTF-8 and it's always a guesswork what it actually is

- no definition of the separator sign, usually it can be a literal comma, a semicolon, or tabs

- escaping of the separator sign in column values is always fun

- escaping of newlines is even MORE fun

- line endings are not specified

Every piece of software there is has their own expectations on what specific kind of format it expects, and to make it worse, using "Microsoft Excel" as a target/source isn't foolproof either because it behaves differently between the Mac and the Windows version!

JSON in contrast is clearly defined in all of these issues and has battle-tested implementations in every programming language there is.

link

seanhunter 616 days ago

Most of these issues while real don't actually arise in this case, because we're not trying to ETL some random file, we are the ones talking to the database so we get to choose exactly how the data gets formatted on extract.

For example, here's your list fully handled in postgres: 1. SET CLIENT_ENCODING TO 'value'; (eg 'UTF8') 2. COPY ... with FORMAT CSV DELIMITER 'delimiter_character' QUOTE 'quote_character'

Now the output format is fully specified and everything just works fine (including for input into excel)

   The values in each record are separated by the DELIMITER character. If the value contains the delimiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, then the whole value is prefixed and suffixed by the QUOTE character, and any occurrence within the value of a QUOTE character or the ESCAPE character is preceded by the escape character.

https://www.postgresql.org/docs/current/sql-copy.html

link

mschuster91 616 days ago

Yeah, the problem is not every application can read such files. Particularly when newlines are involved, it's a hit and miss.

link

seanhunter 616 days ago

Well not every application can read any file and either way "sq" isn't going going to make any difference. Like I said they are real issues but don't arise in this case.

link