Hacker News new | ask | show | jobs
by dspillett 105 days ago
> I'm not sure why anyone would choose varchar for a column in 2026

The same string takes roughly half the storage space, meaning more rows per page and therefore a smaller working set needed in memory for the same queries and less IO. Also, any indexes on those columns will also be similarly smaller. So if you are storing things that you know won't break out of the standard ASCII set⁰, stick with [VAR]CHARs¹, otherwise use N[VAR]CHARs.

Of course if you can guarantee that your stuff will be used on recent enough SQL Server versions that are configured to support UTF8 collations, then default to that instead unless you expect data in a character set where that might increase the data size over UTF16. You'll get the same size benefit for pure ASCII without losing wider character set support.

Furthermore, if you are using row or page compression it doesn't really matter: your wide-character strings will effectively be UTF8 encoded anyway. But be aware that there is a CPU hit for processing compressed rows and pages every access because they remain compressed in memory as well as on-disk.

--------

[0] Codes with fixed ranges, etc.

[1] Some would say that the other way around, and “use NVARCHAR if you think there might be any non-ASCIII characters”, but defaulting to NVARCHAR and moving to VARCHAR only if you are confident is the safer approach IMO.

1 comments

utf16 is more efficient if you have non-english text, utf8 wastes space with long escape sequences. but the real reason to always use nvarchar is that it remains sargeable when varchar parameters are implicitly cast to nvarchar.
UTF-16 is maybe better if your text is mostly made of codepoints which need 3 UTF-8 code units but only one (thus 2 bytes) UTF-16 code unit. This is extremely rare for general text and so you definitely shouldn't begin by assuming UTF-16 is a good choice without having collected actual data.
The old defense of 16-bit chars, popping up in 2026 still! Utf8 is efficient enough for all general purpose uses.

If you're storing gigabytes of non-latin-alphabet text, and your systems are constrained enough that it makes a difference, 16-bit is always there. But I'd still recommend anyone starting a system today to not worry and use utf8 for everything.j

it certainly isn't the best choice for sql server, see: https://sqlquantumleap.com/2018/09/28/native-utf-8-support-i...
What do you mean with non-english text? I don't think "Ä" will be more efficient in utf16 than in utf8. Or do you mean utf16 wins in cases of non-latin scripts with variable width? I always had the impression that utf8 wins on the vast majority of symbols, and that in case of very complex variable width char sets it depends on the wideness if utf16 can accommodate it. On a tangent, I wonder if emoji's would fit that bill too..
Japanese, Chinese, Korean and Indic scripts are mostly 2 bytes per character on UTF-16 and mostly 3 bytes per character in UTF-8.
Really, as an East Asian language user the rest of the comments here make me want to scream.
I am not sure if you mean me, as I just asked a question. I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.
> I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

there are over a million codepoints in unicode, thousands for latin and other language agnostic symbols emojis etc. utf-8 is designed to be backwards compatible with ascii, not to efficiently encode all of unicode. utf-16 is the reasonably efficient compromise for native unicode applications hence it being the internal format of strings in C# and sql server and such.

the folks bleating about utf-8 being the best choice make the same mistake as the "utf-8 everywhere manifesto" guys: stats skewed by a web/american-centric bias - sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples

hn often makes me want to scream
The non sargeablilty is an optimizer deficiency IMO. It could attempt to cast just like this article is doing manually in code, if that success use index, if it fails scan and cast a million times the other way in a scan.
implicit casts should only widen to avoid quiet information loss, if the optimizer behaved as you suggest the query could return incorrect results and potentially more than expected, with even worse consequences
It should not return incorrect results, if the nvarchar only contains ascii it will cast perfectly, if it doesn't then do the slow scan path, it's a simple check and the same work its doing for every row in the current behavior except one time and more restricted. Can you give me an example of an incorrect result here?

I am not talking about the default cast behavior from nvarchar to varchar, but a specific narrow check the optimizer can use to make decision in the plan of ascii or not with no information loss because it will do the same thing as before if it does not pass the one time parameter check.

By far the most common cause of this situation is using ascii only in a nvarchar because like say in this example the client language is using an nvarchar equivalent for all strings, which is pretty much universal now days and that is the default conversion when using a sql client library, one must remember to explicitly cast rather than the db doing it for you which is the expected behavior and the source of much confusion.

This would be purely an optimization fast path check otherwise fall back to the current slow path, correct results always with much faster results if only ascii is present in the string.