| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by beart 103 days ago
	I agree with your first point. I've seen this same issue crop up in several other ORMs. As to your second point. VARCHAR uses N + 2 bytes where as NVARCHAR uses N*2 + 2 bytes for storage (at least on SQL Server). The vast majority of character fields in databases I've worked with do not need to store unicode values.

3 comments

wvenable 103 days ago

> The vast majority of character fields in databases I've worked with do not need to store unicode values.

This has not been my experience at all. Exactly the opposite, in fact. ASCII is dead.

link

SigmundA 103 days ago

Vast majority of text fields I see are coded values that are perfectly fine using ascii, but I deal mostly with English language systems.

Text fields that users can type into directly especially multiline tend to need unicode but they are far fewer.

link

simonask 103 days ago

English has plenty of Unicode — claiming otherwise is such a cliché…

Unicode is a requirement everywhere human language is used, from Earth to the Boöotes Void.

link

SigmundA 103 days ago

I am talking about coded values, like Status = 'A', 'B' or 'C'

Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.

link

wvenable 103 days ago

It's way better to just use a DBMS that supports enums. I know SQL server isn't one of those but I still don't store my coded values as strings.

link

SigmundA 103 days ago

How do you store them? Also enums are not user configurable normally. It would be a good feature to have them, but they don't work well in many cases.

Typical code tables with code, description and anything else needed for that value which the user can configure in the app.

Sure you can use integers instead of codes, now all your results look like 1, 2, 3, 4 for all your coded columns when trying to debug or write ad-hoc stuff. Also ints are not variable length so your wasting space for short codes and you have to know ahead time if its only going to be 1,2,4 or 8 bytes.

link

andy81 103 days ago

The way to do enums in SQL (generally, not just MSSQL) is another table. It's better that they don't offer several ways to do the same thing.

link

kstrauser 103 days ago

Those are all single byte characters in UTF-8.

link

SigmundA 103 days ago

We are talking nvarchar here, yes UTF-8 solves this issue completely and MSSQL supports it now days with varchar.

link

croes 103 days ago

But nvarchar is UTF-16

link

simonask 103 days ago

No. Look closer.

link

Slothrop99 103 days ago

Just to be pedantic, those characters are in 'ANSI'/CP1252 and would be fine in a varchar on many systems.

Not that I disagree  Win32/C#/Java/etc have 16-bit characters, your entire system is already 'paying the price', so weird to get frugal here.

link

simonask 103 days ago

My comment contains two glyphs that are not in CP1252.

link

NegativeLatency 103 days ago

Also less awkward to make it right the first time, instead of explaining why someone can’t type their name or an emoji

link

SigmundA 103 days ago

Specifically not talking about a name field

link

zabzonk 103 days ago

> Unicode is a requirement everywhere human language is used

Strange then how it was not a requirement for many, many years.

link

Macha 102 days ago

Oh, it was. It was fun being unable to type a euro sign or the name Seán without it being garbled. Neither were matched quotation marks, and arguably computer limitations killed off naïve and café too.

Don’t confuse people groaning and putting up with limitations as justifying those limitations.

link

pjmlp 102 days ago

In Portugal it always was, that is why we got to use eh for é, ah for á, he for è, c, for ç and many other tricks.

Shared by other European languages, like ou for ö in German, kalimera for καλημέρα, and so on all around the world in non-English speaking countries during the early days of computing.

link

procaryote 103 days ago

It was a mess back then though. Unicode fixed that.

link

zabzonk 103 days ago

I'm not convinced that Unicode fixed anything. I was kind of hoping, way back when, that everyone would adopt ASCII, as a step to a more united world. But things seem to have got more differentiated, and made things much more difficult.

link

paulddraper 101 days ago

Or rather, computers had inadequate support.

link

psidebot 103 days ago

Some examples of coded fields that may be known to be ascii: order name, department code, business title, cost center, location id, preferred language, account type…

link

SigmundA 103 days ago

To complicate matters SQL Server can do Nvarchar compression, but they should have just done UTF-8 long ago:

https://learn.microsoft.com/en-us/sql/relational-databases/d...

Also UTF-8 is actually just a varchar collation so you don't use nvarchar with that, lol?

link

_3u10 103 days ago

Generally if it stores user input it needs to support Unicode. That said UTF-8 is probably a way better choice than UTF-16/UCS-2

link

SigmundA 103 days ago

UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.

I have avoided it and have not followed if the issues are fully resolved, I would hope they are.

link

kstrauser 103 days ago

> UTF-8 is a relatively new thing in MSSQL and had lots of issues initially, I agree it's better and should have been implemented in the product long ago.

Their insistence on making the rest of the world go along with their obsolete pet scheme would be annoying if I ever had to use their stuff for anything ever. UTF-8 was conceived in 1992, and here we are in 2026 with a reasonably popularly database still considering it the new thing.

link

da_chicken 103 days ago

I would be more critical of Microsoft choosing to support UCS-2/UTF-16 if Microsoft hadn't completed their implementation of Unicode support in the 90s and then been pretty consistent with it.

Meanwhile Linux had a years long blowout in the early 2000s over switching to UTF-8 from Latin-1. And you can still encounter Linux programs that choke on UTF-8 text files or multi-byte characters 30 years later (`tr` being the one I can think of offhand). AFAIK, a shebang is still incompatible with a UTF-8 byte order mark. Yes, the UTF-8 BOM is both optional and unnecessary, but it's also explicitly allowed by the spec.

link

downsplat 103 days ago

It's not really a Linux vs MS thing though. When Unicode first came out, it was 16-bit, so all the early adopters went with that. That includes Java, Windows, JavaScript, the ICU lintaries, LibreOffice and its predecessors, .NET, the C language (remember wchar_t?), and probably a few more.

Utf8 turned out to be the better approach, and it's slowly taking over, but it was not only Linu/Unix that pushed it ahead, the entire networking world did, especially http. Props also to early perl for jumping straight to utf8.

Still... Utf8's superiority was clear enough by 2005 or so, MS could and should have seen it by then instead of waiting until 2019 to add utf8 collations to its database. Funny to see Sql Server falling behind good old Mysql on such a basic feature.

link

wvenable 102 days ago

Database systems are inherently conservative -- once you add something you have to support it forever. Microsoft went hog wild on XML in the database and I haven't seen it used in over a decade now.

link

recursive 103 days ago

In 92 it was a conference talk. In 98 it was adopted by the IETF. Point probably stands though.

link

swasheck 103 days ago

the data types were introduced with SQL Server 7 (1998) so i’m not sure it’s accurate to state that it’s considered as the new thing.

link

SigmundA 103 days ago

UTF-8 was introduced in SQL Server 2019:

https://learn.microsoft.com/en-us/sql/sql-server/what-s-new-...

link

swasheck 98 days ago

thanks. now i see the point that the poster was making.

link

Dwedit 103 days ago

The one place UTF-16 massively wins is text that would be two bytes as UTF-16, but three bytes as UTF-8. That's mainly Chinese, Japanese, Korean, etc...

link