Hacker News new | ask | show | jobs
by simonw 2103 days ago
My least favorite part of database design is the bit where you have to pick lengths for your char columns.

Twenty years in and I'm still picking these pretty much by guessing. And when I guess wrong it causes really annoying problems further down the line.

I love how SQLite doesn't make me do this - it just has a TEXT type which is always unlimited in length.

10 comments

I suppose that in every RDBMs that makes the distinction, it's an optimization matter - VARCHAR being stored on-page, while TEXT off-page (although there can be optimizations for short TEXT values); the latter will cause an extra page seek on access.

Some database [versions] may also be unable to apply certain optimizations in certain cases, when TEXT is used (eg. temp tables on MySQL <= 5.7).

That doesn't prevent one from always using TEXT, and possibly, for most of the use cases (surely, if one uses SQLite, that's the case), the performance impact is not meaningful.

For what I understand in the docs, in postgres at least, VARCHAR and TEXT are the same thing, and CHAR actually has the performance hit.
They are implemented the same, but they aren't the same type. Notably, each parameterization of varchar (or char) is its own type. This can cause issues when trying to change the length parameter. For this reason I prefer to use TEXT with a CHECK constraint.
What's wrong with TEXT type for postgres, mysql, etc? In Postgres you don't need to declare a length for varchar either.
Yeah, in fact, Postgres encourages the use of TEXT over VARCHAR. The documentation even states:

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column.

https://www.postgresql.org/docs/current/datatype-character.h...

Postgres also has a TEXT type like this.

Edit: I see mpolun left a similar comment here, but it looks like he has been mostly shadowbanned for about 8 years.

> Edit: I see mpolun left a similar comment here, but it looks like he has been mostly shadowbanned for about 8 years.

Yeah same. I don't have enough karma to vouch for their comment but I see no big reason that said comment should be dead. Their account is from 2012 and the vast majority of their few comments seem to be contributing to the conversation. Their first dead comment is also from 2012 but at a glance said comment is the only one that stands out as not contributing much to the conversation. And that's being harsh even – I've probably made less substantial comments in the past myself. Yet like 30% of their 3 pages of comments are dead. And looking at their submissions they have ever only submitted 3 stories, 2 of which appear to be from a domain that they themselves control. Hardly enough to be subjected to having so many of their comments killed I think. Though of course there might be other factors at play, but from what I see on their profile page I see nothing bad enough to warrant this.

I have 1500 karma, but I didn't see an option to vouch for it... I have seen that option on other comments, though. Maybe dang can take a look at their account.
Just for general interest, how can you see their comment or assert they have been shadowbanned?
> how can you see their comment

Go to your profile and ensure you have showdead set to yes.

> assert they have been shadowbanned

See https://news.ycombinator.com/threads?id=mpolun and check out how many of their comments are showing as dead in combination with the contents of said comments. (The step above about setting showdead to yes might be required before you follow said link in order to actually see the dead comments). Almost every single one of the dead comments is contributing to the conversation. This is indicative of a shadow ban. HN users would not be downvoting the vast majority of these comments I think.

There's always a limit. You either define and manage it yourself or it'll be done for you when some part of your system breaks. In the end if you actually need performance and reliability everything will have a bound (if not fixed) size and larger data will be processed as a stream, anyway.
Postgres tip: define columns as TEXT, but with a CHECK constraint that the length is what you expect. This avoids the problem with varchar(...) that views inherit the underlying column types (including the length specifier), and prevent you from changing the maximum length in the table unless you drop and recreate the view. (There are ways around this through system tables but they are wholly unwieldy.)
I really like that - especially since changing CHECK constraints can be done on large tables without having to rebuild the entire table.
Just use TEXT in other databases as well. It really shouldn't matter much in modern dbs
Why don’t you use varchar(max) as the range always. The varchar data type specified that the length of this attribute is variable in each record and the memory allocated depends only on the number of actual characters stored in the column.
That’s a bad idea: that pushes the burden of data validation entirely on your client or application code. Textual column lengths should be used to enforce sanity checks on data.

I’ve worked on more projects than I care for which had nvarchar(max) columns for storing the contents of a small 3-4 line HTML textarea: most users were expected to type in less than 100 words or copy-paste the output of another program.

One day, that other program had a bug that made it generate about a gigabyte of textual output. That program had a “Copy output” button so the user didn’t realise how much data they were copying. I don’t know how it didn’t timeout when it was inserted, but that user brought the system down for everyone because that gigabyte-sized text value was used in lots of places.

>>That’s a bad idea: that pushes the burden of data validation entirely on your client or application code. Textual column lengths should be used to enforce sanity checks on data.

It just doesn't work out that way in practice. For example SQLite, the most popular RDBMS of all time, pisses on strict column type and gives you value type instead with hints and storage classes. For several large systems I architected in SQL Server and Oracle, I gave developers heuristics to follow for column type selection and in some cases strictly re

I compare it to "defence-in-depth" (i.e. data validation at every level).

Ever since RDMBS moved-away from XBase-style table-files database textual/string column length limits are conceptually the same thing as having a CHECK CONSTRAINT on the length of a string (something that SQLite does support!).

e.g. https://stackoverflow.com/questions/8252875/how-to-restrict-...

I think the CHECK CONSTRAINT is explicit and intentional WRT data validation. However it does not make a different type out of the value. I don't really think there is a one-size-fits-all for this problem anyway. The real problem expressed is accepting too much data - a check constraint requires the entire value to be present before validation. So if too much data shows up and there is a weak link in the data processing system, T.U. as they say...Many libraries provide for controlling how much physical data can be sent without resorting to a large data API.

Most character values in a database are not really text entered by a person, it's typically a symbol for some kind of enumeration or key referring to external semantics, e.g. compass direction "N", "NNE", etc. Or street names or postal codes or country names and so on. You can either model this directly or if it makes no difference, accept whatever you get. Actually examining the data can be interesting; I noticed once in a database with 50 billion stored dollar amounts there were only about 50,000 unique values. This led me to assert the dollar amounts were actually symbols in the system.

Sorry, my comment was truncated in my original post and I didn't notice it in time to edit it.

Varchar(max) is fine for some cases but you can't use it for everything because there's some limitations compared to varchar(1000) because the latter is stored in the actual table structure while the former is not.
With postgress it doesn't matter. Chars are just varcars under the hood, so you should always use the latter.
Until some user inserts 100MB in a single column of a single row.
I'd rather guard against that in my input validation application logic than pick the wrong value when I'm designing my table and have to deal with fixing that later on.
Don't the other ones have TEXT, too?
Same with postgres. I never use specific lengths for text on postgres.