| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tfvlrue 442 days ago
	In the past I remember that Excel not properly handling UTF-8 encoded text in a CSV. It would treat it as raw ASCII (or possibly code page 1252). So if you opened and saved a CSV, it would corrupt any Unicode text in the file. It's possible this has been fixed in newer versions, I haven't tried in a while.

1 comments

qw 442 days ago

It's related to how older versions of Windows/Office handled Unicode in general.

From what I have heard, it's still an issue with Excel, although I assume that Windows may handle plain text better these days (I haven't used it in a while)

You need to write an UTF-8 BOM at the beginning (0xEF, 0xBB, 0xBF), if you want to make sure it's recognized as UTF-8.

link

darthwalsh 442 days ago

Ugh, UTF-8 BOM. Many apps can handle UTF-8 but will try to return those bytes as content; maybe ours in 2015 too

I was on the Power Query team when we were improving the encoding sniffing. An app can scan ahead i.e. 64kB, but ultimately the user needs to just say what the encoding is. All the Power Query data import dialogs should let you specify the encoding.

link

zzo38computer 441 days ago

UTF-8 BOM is probably not a good idea for anything other than (maybe) plain text documents. For data, many (although not all) programs should not need to care about character encoding, and if they include something such as UTF-8 BOM then it will become necessary to consider the character encoding even though it shouldn't be necessary.

link