Hacker News new | ask | show | jobs
by mortehu 3725 days ago
Tip for SQL users:

If you give all your ID fields unique names, e.g. by calling your field "reservation_id" instead of "id", even in the reservation table, you can do stuff like:

SELECT * FROM reservation JOIN guest USING (reservation_id);

By doing "USING (reservation_id)" instead of "ON reservation.id = guest.reservation_id", the field will be automatically deduplicated, so you don't have to qualify it elsewhere in the query, and "SELECT *" will return only one copy.

14 comments

I actually prefer just using id as the primary key, and I like the explicitness of seeing the table/alias before the column in complex queries. I don't care too much about typing it out; reservation.id isn't longer than reservation_id, and the savings of USING vs ON seem minimal. I also don't care about deduplicating that one field, as I'll likely need to consider other duplicate fields in the results, like created_at or description.

But I can see arguments in both directions. I think the big thing is consistency. If you consistently use <table name>_id as the PK and your tooling doesn't fight it, then cool.

I do feel like this benefit breaks down rather quickly though. For example, if you wanted to track who created a guest record, you probably wouldn't name the column user_id as it's ambiguous. So you might use creator_id to signify it points to the user who created the guest, not to the user invited to the reservation – thus negating your ability to join to a user table via USING.

That's not to say it's a bad tip, just because there are cases where it doesn't fit. Definitely worth knowing. Thanks for sharing.

When you have to join on composite PKs, it's easier to appreciate the parent's advice.
Some things that I'd love to see as options in rdbms:

Enforced globally unique column names - natural joins on everything.

Throw an exception if a query requires a table scan (bonus points if it printed the statement that would create the appropriate index).

Translate all update and delete statements into inserts (enforced immutable data + versioning).

For queries returning results from a majority of rows in the table, won't a sequential table scan be faster than an index scan?
For non-clustered indices, yes.
and if the order is not important. Reordering large numbers of rows can be a very expensive operation.
You can force the last one but just dropping the delete and update permissions from your role.
Yeah, I hear that. I also tend to always use surrogate keys. It can have performance implications, but that's typically not my biggest concern.

EDIT: I guess that doesn't really address your point of needing to join on a composite natural key though.

Do you know SSSKA? Society to Stop Surrogate Key Abuse: http://www.slideshare.net/PGExperts/keyvil-lightning-talk?qi... ;)
I didn't know about it, but I don't find those slides compelling, not at all. For example, their response to the "keys shouldn't change" is "what is CASCADE for then?". That seems like a naive view of data.

It ignores things like integrating with other systems. If I export my data to an OLAP database, I can't just cascade that key change through it without manual effort.

It also ignores managing historical data. Maybe I don't want that natural key update cascaded everywhere. In their example of hotel room number, if that public-facing number changed (which does happen), I wouldn't necessarily want that cascaded everywhere. For example, I might need past reservations/invoices from before the change to keep the old room number, to match all correspondence with the guest, etc. Okay, so maybe you don't automatically cascade everywhere, but then you need some way to link that old room number and the new room number in all of your reporting.

Bleh. That's a lot of headache that can be greatly minimized with a surrogate key, for very little drawback in my experience. Sure, I can imagine scenarios where the performance impact or additional storage could actually be a real negative, but for your average CRUD app, I think surrogate keys add way, way more value than cost.

Sure, there's CASCADE, but as you pointed out it has several drawbacks.

That's why ideally natural keys should be immutable. That has many advantages over surrogate keys.

There's a follow up post (posts) to this presentation: http://www.databasesoup.com/2015/03/primary-keyvil-reprised.... where many problems caused by the abuse of surrogate keys are illustrated.

But the main point is simple: surrogate keys (abuse) is an easy way to data corruption. Not in the sense of durability as in ACID, but in the sense of humans (manually or by bugs in the software) duplicating information that is not enforced to be unique thanks to using non-sense surrogate keys as the uniqueness criteria.

Not really a fan. When your table name is "reservation", you're prefixing "id" with the table name ("reservation_id"), but you don't do the same for the other columns. I've never liked the inconsistency of having "reservation_id", but other columns like "name" instead of "reservation_name". Especially on longer table names where you wind up with columns like "this_really_long_table_id".

All just to shorten join clauses with a natural join or USING? You already get shortened joins - less so, but still a gain - by using table aliases so as not to repeat the long table names in each query multiple times. IMO it also takes an extra moment to parse the ambiguous "id" columns to resolve them mentally to which id is being referred to.

Personally, for me it makes sense that the id field has a different distintive name for each table because it's a specific kind of id, i.e., customerID is not interchangeable with reservationID. For me, this is not so much about prepending the field with the table name, but about calling things by their name. "id" is just too vague, even if you use it in the context of a specific table.
I believe this is what foreign keys are for, and also, they are usually both ints/uuids/etc so generally they are interchangeable.

I fail to see how

  customer.id = product.id
is any less obvious of a bug than

  customer.customerID = product.productID
With aliases that first expression might be:

  c.id = p.id
That seems like an easy error to make...
> With aliases that first expression might be: c.id = p.id

Which is why I don't use aliases, so that future me doesn't have to wonder what the hell he wrote a few months ago.

Once an application has been alive for more than a few months, I find it often is subject to self-joins and other query approaches which pretty much beg for aliases in the name of legibility.
I once made that argument, we ended up with reservation_name, I regret I ever said anything about it.

Apparently, it made joins more clear, I seriously wonder how often people fucked that up for them to think it was a good idea to prefix every fucking column with its table name.

It's like when I see unit tests for setters and think, gee, setters seem pretty straightforward to me, how often to people fuck them up?

>> we ended up with reservation_name

That's absolutely terrifying. You've just provided my subconscious with new material with which to populate my nightmares.

How did the code using that database operate? Were you using "reservation_name", or did the application revert the naming scheme by mapping the column to "name"?

Either way, FML.

FFS, it's just reservation.reservation_name, it's not like he's parsing HTML with regular expressions or something. Like... big deal.
I'm not sure if this is humor going over my head, or seriousness smacking me in the face.
Could you elaborate on why this is so awful? (genuine question)
I can think of a few reasons:

- The field name implies that the primary key field is a string. This entails numerous issues (eg. How do you generate a new unique ID?) - It is not obvious that the field is the primary key. - A 'name' property on a reservation doesn't make sense. Is it the name of the person making the reservation? How can this be unique? etc.

My use of "name" was a poorly chosen example to tie to the concept of a reservation. I just chose it because "name" is an extremely common and unambiguous column in many tables.
Too much typing / line noise.
Yes, you always have to be really careful when positing an absurd solution to make a point; there's always the risk you'll be taken seriously. :)
I prefer longer, more descriptive table names coupled with shorter columns names. Then use aliases when writing queries.

    SELECT * FROM long_table_name l LEFT JOIN another_table_here a ON l.id = a.rel_id
Also, most tables end up being used to populate objects. It's simpler to reference an object with `Reservation.id` than `Reservation.reservation_id`

Regardless of the design choices made it's more important to be consistent in naming conventions and to document why a particular design/schema was chosen.

I tabbed away from my previous comment for a minute, which made me forget to mention the ORM scheme in applications. You're exactly right.

Nobody ever uses the class "Reservation" with a property named "reservation_id" in their code. They will name their column "reservation_id" in the database, but then work to undo that at the application level by telling the ORM to map "reservation_id" to an object property named "id". It creates the situation where developers fight against the database schema in their code, and adds an unnecessary level of complexity when comparing code to database. "Why is it 'id' in code but 'reservation_id' in the database?".

A common counterargument from pure DBAs, who are not themselves writing code that integrates with the schemas they create, is that databases should be approached as a completely standalone component that should not be designed with applications' structures in mind. That applications are free to bastardize the database schema in any way they please, so it really doesn't matter anyway. I personally prefer to see it as a reality that applications will be the primary consumer of the database resource, and deserve to have their intended integrations analyzed. Once the database is finalized, it is the application developers who will be studying the schema on a weekly basis. Anything to ease that constant inspection process, the better.

Of course, this is all personal preference. I have learned over the years to finally stop arguing so much over these little details. In real life with coworkers, anyway. It's still fun to battle for your point of view with forum strangers! :D

The only problem I have with this is that when you have a lot of joins and your query is long it gets confusing, as in what table was 'l', what table was 'a' and then you have to scroll up and look it up again, whereas in the parent comment its easy as it makes it explicitly clear and the table name does not matter when looking at joins.
It's long bothered me that SQL doesn't have a way of saying "join these two tables based on the key relationships I've already defined in my schema". (Or maybe some variants do?)

The database already knows that guest.reservation_id is a foreign key referencing reservation.id. Why should you have to repeat yourself?

I have longed for that, too, but I'm not sure I should want it.

Problem is that it becomes problematic as soon as you keep such queries around (in stored procedures, source code, or batch files), and then modify your schema by adding or removing a foreign key.

What if you have more then one foreign key referencing the same table? Or worse, what if you had one and later add another? You would get undefined behavior.

Added:

Also SQL language itself has nothing to do with keys, relationships and other details of database design. It's different level of abstraction.

Various query builders/SQL generators do what you want.

And when there are multiple foreign keys between two tables?
Genuinely curious, what's the use-case for this?
Some examples:

In project management, tasks might have an assigner_id and assignee_id

In transaction management, transactions might have a sender_id and receiver_id

In sports, matches might have a team1_id and team2_id

> In transaction management, transactions might have a sender_id and receiver_id

in normal form, shouldn't you have "transactions", "transaction_senders", and "transaction_receivers" anyway? similarly for the other cases?

i know sometimes the simplicity of a single table is a valid choice, so your point still stands, but it does feel like the examples are cases of "well, if you're were doing the right way(TM), that wouldn't happen"

shouldn't you have "transactions", "transaction_senders", and "transaction_receivers"

were doing the right way(TM), that wouldn't happen

Having a separate transaction_senders and transaction_receivers wouldn't be the right way(TM) unless it's possible to have multiple senders/receivers. If a transaction can only have 1 sender/receiver then normalization is complete when the sender/receiver data is removed from 'transactions' table and replaced with the sender_id and receiver_id columns.

When introducing transaction_senders/transaction_receivers tables without further constraints, it would be immediately possible to have multiple transaction senders and multiple transaction receivers for a single transaction, which is likely an error.

These examples depend on the use case. If you are guaranteed to only have one sender and one receiver, then having them in the same table is already normal form.

Perhaps (?) a less controversial example is a table of flights between airports. Should a single flight have departure_ids and destination_ids? (assuming we treat legs as separate flights)

Or what about package delivery? Do we need sender_ids and receiver_ids there?

created_by and updated_by both pointing at the user table?
Explicit foreign keys perform better in some joins, as the optimiser can assume that all possible values are in the source column of that foreign key.
> join these two tables based on the key relationships I've already defined in my schema

This is almost exactly what "NATURAL JOIN" does. It is like "JOIN USING" with a pre-filled column list.

However, this isn't decided based on the schema, but based on equal column names. So you need to apply the dicipline as noted by the parent comment.

See also the sibling discussion thread at: https://news.ycombinator.com/item?id=11450187

> the field will be automatically deduplicated, so you don't have to qualify it elsewhere in the query, and "SELECT " will return only one copy.

You still have in outer joins. Also, SELECT * is a bad practice (with some exceptions) when used in code, suitable only for ad-hoc queries.

Agreed. Or at least put a table name in front of the asterisk:

  SELECT
      table1.*,
      table2.field
  FROM table1
  JOIN table2
      ON table2.rel_id = table1.id
table.* doesn't really do any better. Database evolves, table's structure changes and you get various bugs and unpredictable behavior. From elusive exceptions to memory hogs to network bottlenecks, depending on DBMS, data itself and data access framework you use.
It does better when, for example, you have a field called 'id' in both tables.

  SELECT
      *
  FROM table1
  JOIN table2
      ON table2.rel_id = table1.id
will use one of the 'id' field from table1 or table2 (depending on the join type).

  SELECT
      table1.*
  FROM table1
  JOIN table2
      ON table2.rel_id = table1.id
will use the 'id' field of table1 and does not include data from table2.

But I agree with your other arguments.

Or if you're feeling lucky:

SELECT * FROM reservation NATURAL INNER JOIN guest;

Err, be careful with natural joins: suppose both tables have a column called "name" or "created_at" -- natural join will create join conditions from those.
I never understood why the thing named "natural join" don't use relationships to determine the joining columns. Does not look natural for me.
Indeed, this is also what always annoyed me.

I suspect that this has to do with the fact that NATURAL JOIN is intended to work for sub-SELECTs and VIEWs, too. On those, it is quite hard and error-prone to determine "foreign key" equivalents.

Also note that restricting NATURAL JOINs to tables wouldn't be a good solution here, either, because that would prevent you from ever JOINing more than 2 tables: When JOINing 3 tables, you effectively do have some kind of sub SELECT after having joined the first two tables, which is then supposed to be joined with the third one.

You get the same sort of problems jessedhillon was talking about.

That is, you lose absolutely nothing - just changes its semantics.

That's why he said, "If you're feeling lucky".
Which in the world of databases is a synonym for "useless". Typical boutique SQL query strings can be anything from a few to a few dozen lines and the compiled form is typically cached by the SQL engine (if you're using parameterised variables, which you are, because not doing so is stupid and dangerous).

So why not be explicit?

I don't like using select * as it will cause the shape of the result set to change upon schema modifications. We use a DB access layer that "knows" the column indices of result sets at compile time and our zero-downtime process relies on that not changing (for the time period where old code is running against new schema).

Imagine two two-column tables, A with id and name, and B with id and value. (Doesn't matter what the columns are; this is just for concreteness.)

select * from A JOIN B using (id) will give you a three column result set, id, name, value.

Now, suppose you want to do a schema modification to add "last _login" to A. Suddenly, your query is returning a 4 column result set, id, name, last_login, value and code using ordinal positions is now reading last_login into the "value" slot.

Maybe it's a corner-case for our use case and zero-downtime release processes, but it's something to be aware of anyway.

Most drivers or interfaces to the db let you use the column's name instead of the ordinal positions, precisely because that decouples the query's result from the things you care about, might be worth checking ^^
I'm curious: what are the advantages of having your access layer rely on column indices rather than column names?
At coding-time (pre-compilation), we create classes representing each stored procedure that we've marked for code generation. That gives parameter list (inc auto-completion in the IDE), column names, types/type safety on inputs and outputs, etc, from the parameter list and result set shape and it was [slightly] easier to write the generic code that binds columns in the result set by column index, since we "knew" the column indices wouldn't change in the result set.

(We had a prior system that relied on column order and so had already banned select * and only allowed additions to the result set on the right side before we even got to the point of doing codegen or zero-downtime releases. This was 2003 that we created the codegen wrapper and 2005 or so when we switched to zero downtime releases.)

I guess "because we're lazy" is the somewhat overly glib answer. ;-)

I don't like this because it leads to the misleading assumption that it doesn't matter which value you refer (left or right), since they are equal. But once you consider collations is easy to see the fallacy:

   left.id = 'Something' COLLATE case insensitive
   right.id = 'SOMETHING' COLLATE case insensitive

   SELECT id FROM left JOIN right USING (id);
Will the value be 'Something' or 'SOMETHING'? Better avoid the surprise and be specific...
I like using id as key, if you want to "decouple" info, i just use reservation.id as reservation_id

This way, my db fetch is always on id and not a longer name ( reservation_id), which on the end of an webapp, makes me save some reading / writing :)

And if you're using a schema you did not create, don't forget to use shorthands! SELECT * FROM reservation r JOIN guest g ON (r.id = g.reservation_id), queries can get messy if you don't use them! :)
Another option:

    SELECT * FROM reservation r JOIN guest USING (r.id);
I really hope you're using SELECT * as an example and that you don't really do that.
In which dialect?
It's part of the SQL-92 standard http://flylib.com/books/en/4.343.1.38/1/
But, but, but ActiveRecord doesn't do that by default so it can't be a good idea [sarcasm]