| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wvenable 146 days ago
	> The vast majority of character fields in databases I've worked with do not need to store unicode values. This has not been my experience at all. Exactly the opposite, in fact. ASCII is dead.

1 comments

SigmundA 146 days ago

Vast majority of text fields I see are coded values that are perfectly fine using ascii, but I deal mostly with English language systems.

Text fields that users can type into directly especially multiline tend to need unicode but they are far fewer.

link

simonask 146 days ago

English has plenty of Unicode — claiming otherwise is such a cliché…

Unicode is a requirement everywhere human language is used, from Earth to the Boöotes Void.

link

SigmundA 146 days ago

I am talking about coded values, like Status = 'A', 'B' or 'C'

Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.

link

wvenable 146 days ago

It's way better to just use a DBMS that supports enums. I know SQL server isn't one of those but I still don't store my coded values as strings.

link

SigmundA 146 days ago

How do you store them? Also enums are not user configurable normally. It would be a good feature to have them, but they don't work well in many cases.

Typical code tables with code, description and anything else needed for that value which the user can configure in the app.

Sure you can use integers instead of codes, now all your results look like 1, 2, 3, 4 for all your coded columns when trying to debug or write ad-hoc stuff. Also ints are not variable length so your wasting space for short codes and you have to know ahead time if its only going to be 1,2,4 or 8 bytes.

link

wvenable 146 days ago

Enums are for non user-configurable values.

For configurable values, obviously you use a table. But those should have an auto-integer primary key and if you need the description, join for it.

Ints are by far more the efficient way to store and query these values -- the length of the string is stored as an int and variable length values really complicate storage and access. If you think strings save space or time that is not right.

link

andy81 146 days ago

The way to do enums in SQL (generally, not just MSSQL) is another table. It's better that they don't offer several ways to do the same thing.

link

SigmundA 146 days ago

Mostly agree separate tables can have multiple attributes besides a text description and can be exposed for modification to the application easily so users or administrators can add and modify codes.

A common extra attribute for a coded value is something for deprecation / soft delete, so that it can be marked as no longer valid for future data but existing data can remain with that code, also date ranges its valid for etc, also parent child code relationships.

Enums would be a good feature but they have a much more limited use case for static values you know ahead of time that will have no other attributes and values cannot be removed even if never used or old data migrated to new values.

Common real world codes like US postal state can take advantage of there being agreed upon codes such as 'NY' and 'New York'.

link

sgarland 146 days ago

While I generally would prefer lookup tables, it's much easier to sell dev teams on "it looks and acts like a string - you don't have to change anything."

link

kstrauser 146 days ago

Those are all single byte characters in UTF-8.

link

SigmundA 146 days ago

We are talking nvarchar here, yes UTF-8 solves this issue completely and MSSQL supports it now days with varchar.

link

croes 146 days ago

But nvarchar is UTF-16

link

simonask 146 days ago

No. Look closer.

link

Slothrop99 146 days ago

Just to be pedantic, those characters are in 'ANSI'/CP1252 and would be fine in a varchar on many systems.

Not that I disagree  Win32/C#/Java/etc have 16-bit characters, your entire system is already 'paying the price', so weird to get frugal here.

link

simonask 146 days ago

My comment contains two glyphs that are not in CP1252.

link

NegativeLatency 146 days ago

Also less awkward to make it right the first time, instead of explaining why someone can’t type their name or an emoji

link

SigmundA 146 days ago

Specifically not talking about a name field

link

zabzonk 146 days ago

> Unicode is a requirement everywhere human language is used

Strange then how it was not a requirement for many, many years.

link

Macha 145 days ago

Oh, it was. It was fun being unable to type a euro sign or the name Seán without it being garbled. Neither were matched quotation marks, and arguably computer limitations killed off naïve and café too.

Don’t confuse people groaning and putting up with limitations as justifying those limitations.

link

pjmlp 145 days ago

In Portugal it always was, that is why we got to use eh for é, ah for á, he for è, c, for ç and many other tricks.

Shared by other European languages, like ou for ö in German, kalimera for καλημέρα, and so on all around the world in non-English speaking countries during the early days of computing.

link

procaryote 146 days ago

It was a mess back then though. Unicode fixed that.

link

zabzonk 146 days ago

I'm not convinced that Unicode fixed anything. I was kind of hoping, way back when, that everyone would adopt ASCII, as a step to a more united world. But things seem to have got more differentiated, and made things much more difficult.

link

procaryote 145 days ago

The options were never ASCII or unicode though. Before unicode we had ASCII + lots of different incompatible encodings that relied on metadata to be properly rendered. That's what unicode fixed

Besides I like being able to put things like →, €, ∞ or ∆ into text. With ascii a lot of things that are nowadays trivial would need markup languages

link

simonask 145 days ago

For whom? Certainly not any of the humans trying to use the computer.

link

paulddraper 145 days ago

Or rather, computers had inadequate support.

link

psidebot 146 days ago

Some examples of coded fields that may be known to be ascii: order name, department code, business title, cost center, location id, preferred language, account type…

link