Hacker News new | ask | show | jobs
by tfvlrue 442 days ago
In the past I remember that Excel not properly handling UTF-8 encoded text in a CSV. It would treat it as raw ASCII (or possibly code page 1252). So if you opened and saved a CSV, it would corrupt any Unicode text in the file. It's possible this has been fixed in newer versions, I haven't tried in a while.
1 comments

It's related to how older versions of Windows/Office handled Unicode in general.

From what I have heard, it's still an issue with Excel, although I assume that Windows may handle plain text better these days (I haven't used it in a while)

You need to write an UTF-8 BOM at the beginning (0xEF, 0xBB, 0xBF), if you want to make sure it's recognized as UTF-8.

Ugh, UTF-8 BOM. Many apps can handle UTF-8 but will try to return those bytes as content; maybe ours in 2015 too

I was on the Power Query team when we were improving the encoding sniffing. An app can scan ahead i.e. 64kB, but ultimately the user needs to just say what the encoding is. All the Power Query data import dialogs should let you specify the encoding.

UTF-8 BOM is probably not a good idea for anything other than (maybe) plain text documents. For data, many (although not all) programs should not need to care about character encoding, and if they include something such as UTF-8 BOM then it will become necessary to consider the character encoding even though it shouldn't be necessary.