Hacker News new | ask | show | jobs
by zyedidia 979 days ago
Does anyone know why LSP uses UTF-16 for encoding columns? It seems like everyone agrees it is a bad choice, so I'm curious about the original reasoning. Are there any benefits at all to using UTF-16, or was it something to do with Microsoft legacy code?
4 comments

The JavaScript VM, Java VM, .NET VM, and several other runtimes (including effectively the entire Windows API) have their fundamental definition of strings be based on UTF-16 baked in.

I believe the original producers and consumers of LSP were written in languages that had string lengths based on UTF-16, so it was the literal easiest way to do it, even though UTF-16 is probably objectively the most painful thing to compute if your string system isn't UTF-16.

LSP eventually got a solution where you can request something other than UTF-16 offset calculations, but I don't remember the details of what that solution is.

There was a lengthy discussion on this [1]. UTF-16 was used because it was convenient: it's what Microsoft API's and JavaScript already use (the latter being the language VS Code is written in).

[1] https://github.com/microsoft/language-server-protocol/issues...

That thread was infuriating. Since when does an encoding format have an evangelical task force? I'm all for UTF8 everywhere but wow some of the replies were super cringe.

Even when the proposal of "UTF-16 default, UTF-8 optional" was made to keep backwards compatibility, it was not enough. It has to be UTF8 because it's superior technically, as if that's the only consideration! I agree they should've just picked one, but I still don't think the maintainers needed a refresher on what is UTF-8 every 3 comments.

Count me among the UTF-8 everywhere absolutists. There are two ways to encode text: UTF-8, and a worse choice.

But I wouldn’t be annoying about it. I’d just tut tut from afar. (Though if the decision is still up in the air, I’d argue as passionately as any preacher to persuade our fellow devs to adopt our lord and savior UTF-8 into their hearts and minds.)

Yeah I would absolutely take utf8 everywhere. I hate dealing with anything else.

But I think the worst part was that the maintainer was clear that he/she wasn't debating this on a technical level. Like, they weren't trying to decide which encoding was better. From what I understand it was more about how best to deal with the (at the time) current design choices without breaking the current implementations, and feedback from actual implementers.

I'm inclined to agree that some manner of backwards compatibility is important. A middle ground with a path towards exclusive UTF-8 use seems like a fine compromise. However three things come to mind:

* LSP is being used outside of VSCode, and while UTF-16 may be helpful in that case it's a hinderance for others.

* Institutional knowledge of UTF-16 ain't great at Microsoft either. Github broke rendering of multibyte characters and it took a random GH user to the devs explain how multibyte characters and strings interact in Javascript before that got fixed.

* [insert lots of handwaving about the downsides of electron]

For earlier archaeology see [19]. It seems to me people had started coding extensions in VS Code without giving any real thought to the question, so the default choice inherited from the language was UTF-16.

[19]: https://github.com/microsoft/language-server-protocol/issues...

JavaScript uses UTF-16 for everything is why, and LSP is a TypeScript-first protocol.
Sadly there is some standard to that. JavaScript source maps also use the same definition for columns.