I am talking about coded values, like Status = 'A', 'B' or 'C'
Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.
How do you store them? Also enums are not user configurable normally. It would be a good feature to have them, but they don't work well in many cases.
Typical code tables with code, description and anything else needed for that value which the user can configure in the app.
Sure you can use integers instead of codes, now all your results look like 1, 2, 3, 4 for all your coded columns when trying to debug or write ad-hoc stuff. Also ints are not variable length so your wasting space for short codes and you have to know ahead time if its only going to be 1,2,4 or 8 bytes.
For configurable values, obviously you use a table. But those should have an auto-integer primary key and if you need the description, join for it.
Ints are by far more the efficient way to store and query these values -- the length of the string is stored as an int and variable length values really complicate storage and access. If you think strings save space or time that is not right.
In the systems I work with most coded values are user configurable.
>But those should have an auto-integer primary key and if you need the description, join for it.
Not ergonomic now when querying data or debugging things like postal state are 11 instead of 'NY'
select * from addresses where state = 11, no thanks.
Your whole results set becomes a bunch of ints that can be easily transposed causing silly errors. Of course I have seen systems that use guids to avoid collision, boy is that fun, just use varchar or char if your penny pinching and ok with fixed sizes.
>the length of the string is stored as an int
No it's stored as a smallint 2 bytes. So a single character code is 3 bytes rather than a 4 byte int. 2 chars is the same as an int. They do not complicate storage access in any meaningful way.
You could use smallint or tinyint for your primary key and I could use char(2) and char(1) and get readable codes if I wanted to really save space.
> They do not complicate storage access in any meaningful way.
Sure they do, because now your row / index is variable length rather than fixed length. Way more complicated. Even 3 bytes is way more complicated to deal with than 4 bytes.
> select * from addresses where state = 11, no thanks.
I will agree that isn't fun. Is it still the trade off I do make? Absolutely. And it's not really that big of a problem; I just do a join. It also helps prevent people from using codes instead of querying the database for the correct value -- what's the point of user-configuration of someone hard-codes 'NY' in a query or in the code.
Mostly agree separate tables can have multiple attributes besides a text description and can be exposed for modification to the application easily so users or administrators can add and modify codes.
A common extra attribute for a coded value is something for deprecation / soft delete, so that it can be marked as no longer valid for future data but existing data can remain with that code, also date ranges its valid for etc, also parent child code relationships.
Enums would be a good feature but they have a much more limited use case for static values you know ahead of time that will have no other attributes and values cannot be removed even if never used or old data migrated to new values.
Common real world codes like US postal state can take advantage of there being agreed upon codes such as 'NY' and 'New York'.
While I generally would prefer lookup tables, it's much easier to sell dev teams on "it looks and acts like a string - you don't have to change anything."
Oh, it was. It was fun being unable to type a euro sign or the name Seán without it being garbled. Neither were matched quotation marks, and arguably computer limitations killed off naïve and café too.
Don’t confuse people groaning and putting up with limitations as justifying those limitations.
In Portugal it always was, that is why we got to use eh for é, ah for á, he for è, c, for ç and many other tricks.
Shared by other European languages, like ou for ö in German, kalimera for καλημέρα, and so on all around the world in non-English speaking countries during the early days of computing.
I'm not convinced that Unicode fixed anything. I was kind of hoping, way back when, that everyone would adopt ASCII, as a step to a more united world. But things seem to have got more differentiated, and made things much more difficult.
The options were never ASCII or unicode though. Before unicode we had ASCII + lots of different incompatible encodings that relied on metadata to be properly rendered. That's what unicode fixed
Besides I like being able to put things like →, €, ∞ or ∆ into text. With ascii a lot of things that are nowadays trivial would need markup languages
Taking double the space for this stuff is a waste of resources and nobody usually cares about extended characters here in English language systems at least they just want something more readable than integers when querying and debugging the data. End users will see longer descriptions joined from code tables or from app caches which can have unicode.