Hacker News new | ask | show | jobs
by wongarsu 49 days ago
Until you hit a CSV exported by Excel
1 comments

I should have said "a text file with no byte-order mark". I would hope that Excel's CSV export, if it's writing UTF-16, is writing a byte-order mark first (though I don't have any Excel-exported CSVs lying around right now to check). The byte-order mark is necessary for UTF-16 since it has big-endian and little-endian variants, but unnecessary (and actually harmful in a few situations) for UTF-8. So naturally, if you assume something is UTF-8 but the first few bytes you encounter are FF FE or FE FF (both of which are illegal in UTF-8) then instead of throwing an error saying "Hey, that's illegal UTF-8, buddy!" you should just reparse in UTF-16 (and you now know the correct byte order to use). In fact, you should read four bytes just to make sure you're not seeing FF FE 00 00, because that would indicate a UTF-32LE document. (Which indicates an ambiguity in UTF-16, that UTF-8 doesn't have. A UTF-16 document that begins with a null byte is likely to be misinterpreted as UTF-32LE).

Before I go off on too much of a rabbit trail, I have two points I want to make:

1. Since UTF-8 should be the default assumption for any sensible software, a byte-order mark is not needed for UTF-8, but any non-UTF-8 encoding should use a byte-order mark. (And in fact needs a BOM, because both UTF-16 and UTF-32 have LE and BE variants).

2. Excel needs to fix its stupid CSV import/export defaults.

Another classic from Microsoft -the Language Server Protocol is UTF-16. Could be paying that price for the rest of time.