| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mortehu 3725 days ago

Tip for SQL users:

If you give all your ID fields unique names, e.g. by calling your field "reservation_id" instead of "id", even in the reservation table, you can do stuff like:

SELECT * FROM reservation JOIN guest USING (reservation_id);

By doing "USING (reservation_id)" instead of "ON reservation.id = guest.reservation_id", the field will be automatically deduplicated, so you don't have to qualify it elsewhere in the query, and "SELECT *" will return only one copy.

14 comments

Tenhundfeld 3725 days ago

I actually prefer just using id as the primary key, and I like the explicitness of seeing the table/alias before the column in complex queries. I don't care too much about typing it out; reservation.id isn't longer than reservation_id, and the savings of USING vs ON seem minimal. I also don't care about deduplicating that one field, as I'll likely need to consider other duplicate fields in the results, like created_at or description.

But I can see arguments in both directions. I think the big thing is consistency. If you consistently use <table name>_id as the PK and your tooling doesn't fight it, then cool.

I do feel like this benefit breaks down rather quickly though. For example, if you wanted to track who created a guest record, you probably wouldn't name the column user_id as it's ambiguous. So you might use creator_id to signify it points to the user who created the guest, not to the user invited to the reservation – thus negating your ability to join to a user table via USING.

That's not to say it's a bad tip, just because there are cases where it doesn't fit. Definitely worth knowing. Thanks for sharing.

jessedhillon 3725 days ago

When you have to join on composite PKs, it's easier to appreciate the parent's advice.

erpellan 3725 days ago

Some things that I'd love to see as options in rdbms:

Enforced globally unique column names - natural joins on everything.

Throw an exception if a query requires a table scan (bonus points if it printed the statement that would create the appropriate index).

Translate all update and delete statements into inserts (enforced immutable data + versioning).

bmdavi3 3725 days ago

For queries returning results from a majority of rows in the table, won't a sequential table scan be faster than an index scan?

jpitz 3725 days ago

For non-clustered indices, yes.

jeremysmyth 3722 days ago

and if the order is not important. Reordering large numbers of rows can be a very expensive operation.

moron4hire 3723 days ago

You can force the last one but just dropping the delete and update permissions from your role.

Tenhundfeld 3725 days ago

Yeah, I hear that. I also tend to always use surrogate keys. It can have performance implications, but that's typically not my biggest concern.

EDIT: I guess that doesn't really address your point of needing to join on a composite natural key though.

ahachete 3725 days ago

Do you know SSSKA? Society to Stop Surrogate Key Abuse: http://www.slideshare.net/PGExperts/keyvil-lightning-talk?qi... ;)

Tenhundfeld 3725 days ago

I didn't know about it, but I don't find those slides compelling, not at all. For example, their response to the "keys shouldn't change" is "what is CASCADE for then?". That seems like a naive view of data.

It ignores things like integrating with other systems. If I export my data to an OLAP database, I can't just cascade that key change through it without manual effort.

It also ignores managing historical data. Maybe I don't want that natural key update cascaded everywhere. In their example of hotel room number, if that public-facing number changed (which does happen), I wouldn't necessarily want that cascaded everywhere. For example, I might need past reservations/invoices from before the change to keep the old room number, to match all correspondence with the guest, etc. Okay, so maybe you don't automatically cascade everywhere, but then you need some way to link that old room number and the new room number in all of your reporting.

Bleh. That's a lot of headache that can be greatly minimized with a surrogate key, for very little drawback in my experience. Sure, I can imagine scenarios where the performance impact or additional storage could actually be a real negative, but for your average CRUD app, I think surrogate keys add way, way more value than cost.

ahachete 3725 days ago

Sure, there's CASCADE, but as you pointed out it has several drawbacks.

That's why ideally natural keys should be immutable. That has many advantages over surrogate keys.

There's a follow up post (posts) to this presentation: http://www.databasesoup.com/2015/03/primary-keyvil-reprised.... where many problems caused by the abuse of surrogate keys are illustrated.

But the main point is simple: surrogate keys (abuse) is an easy way to data corruption. Not in the sense of durability as in ACID, but in the sense of humans (manually or by bugs in the software) duplicating information that is not enforced to be unique thanks to using non-sense surrogate keys as the uniqueness criteria.

developer2 3725 days ago

Not really a fan. When your table name is "reservation", you're prefixing "id" with the table name ("reservation_id"), but you don't do the same for the other columns. I've never liked the inconsistency of having "reservation_id", but other columns like "name" instead of "reservation_name". Especially on longer table names where you wind up with columns like "this_really_long_table_id".

All just to shorten join clauses with a natural join or USING? You already get shortened joins - less so, but still a gain - by using table aliases so as not to repeat the long table names in each query multiple times. IMO it also takes an extra moment to parse the ambiguous "id" columns to resolve them mentally to which id is being referred to.

Mahn 3725 days ago

Personally, for me it makes sense that the id field has a different distintive name for each table because it's a specific kind of id, i.e., customerID is not interchangeable with reservationID. For me, this is not so much about prepending the field with the table name, but about calling things by their name. "id" is just too vague, even if you use it in the context of a specific table.

fleitz 3725 days ago

I believe this is what foreign keys are for, and also, they are usually both ints/uuids/etc so generally they are interchangeable.

I fail to see how

  customer.id = product.id

is any less obvious of a bug than

  customer.customerID = product.productID

jessaustin 3725 days ago

With aliases that first expression might be:

  c.id = p.id

That seems like an easy error to make...

liotier 3725 days ago

> With aliases that first expression might be: c.id = p.id

Which is why I don't use aliases, so that future me doesn't have to wonder what the hell he wrote a few months ago.

rlucas 3724 days ago

Once an application has been alive for more than a few months, I find it often is subject to self-joins and other query approaches which pretty much beg for aliases in the name of legibility.

fleitz 3725 days ago

I once made that argument, we ended up with reservation_name, I regret I ever said anything about it.

Apparently, it made joins more clear, I seriously wonder how often people fucked that up for them to think it was a good idea to prefix every fucking column with its table name.

It's like when I see unit tests for setters and think, gee, setters seem pretty straightforward to me, how often to people fuck them up?

developer2 3725 days ago

>> we ended up with reservation_name

That's absolutely terrifying. You've just provided my subconscious with new material with which to populate my nightmares.

How did the code using that database operate? Were you using "reservation_name", or did the application revert the naming scheme by mapping the column to "name"?

Either way, FML.

PeCaN 3725 days ago

FFS, it's just reservation.reservation_name, it's not like he's parsing HTML with regular expressions or something. Like... big deal.

kbenson 3725 days ago

I'm not sure if this is humor going over my head, or seriousness smacking me in the face.

davvolun 3721 days ago

http://stackoverflow.com/questions/1732348/regex-match-open-...

vmasto 3725 days ago

Could you elaborate on why this is so awful? (genuine question)

bottled_poe 3725 days ago

I can think of a few reasons:

- The field name implies that the primary key field is a string. This entails numerous issues (eg. How do you generate a new unique ID?) - It is not obvious that the field is the primary key. - A 'name' property on a reservation doesn't make sense. Is it the name of the person making the reservation? How can this be unique? etc.

developer2 3725 days ago

My use of "name" was a poorly chosen example to tie to the concept of a reservation. I just chose it because "name" is an extremely common and unambiguous column in many tables.

fleitz 3725 days ago

Too much typing / line noise.

kbenson 3725 days ago

Yes, you always have to be really careful when positing an absurd solution to make a point; there's always the risk you'll be taken seriously. :)

leepowers 3725 days ago

I prefer longer, more descriptive table names coupled with shorter columns names. Then use aliases when writing queries.

    SELECT * FROM long_table_name l LEFT JOIN another_table_here a ON l.id = a.rel_id

Also, most tables end up being used to populate objects. It's simpler to reference an object with `Reservation.id` than `Reservation.reservation_id`

Regardless of the design choices made it's more important to be consistent in naming conventions and to document why a particular design/schema was chosen.

developer2 3725 days ago

I tabbed away from my previous comment for a minute, which made me forget to mention the ORM scheme in applications. You're exactly right.

Nobody ever uses the class "Reservation" with a property named "reservation_id" in their code. They will name their column "reservation_id" in the database, but then work to undo that at the application level by telling the ORM to map "reservation_id" to an object property named "id". It creates the situation where developers fight against the database schema in their code, and adds an unnecessary level of complexity when comparing code to database. "Why is it 'id' in code but 'reservation_id' in the database?".

A common counterargument from pure DBAs, who are not themselves writing code that integrates with the schemas they create, is that databases should be approached as a completely standalone component that should not be designed with applications' structures in mind. That applications are free to bastardize the database schema in any way they please, so it really doesn't matter anyway. I personally prefer to see it as a reality that applications will be the primary consumer of the database resource, and deserve to have their intended integrations analyzed. Once the database is finalized, it is the application developers who will be studying the schema on a weekly basis. Anything to ease that constant inspection process, the better.

Of course, this is all personal preference. I have learned over the years to finally stop arguing so much over these little details. In real life with coworkers, anyway. It's still fun to battle for your point of view with forum strangers! :D

whitenoice 3725 days ago

The only problem I have with this is that when you have a lot of joins and your query is long it gets confusing, as in what table was 'l', what table was 'a' and then you have to scroll up and look it up again, whereas in the parent comment its easy as it makes it explicitly clear and the table name does not matter when looking at joins.

daxelrod 3725 days ago

It's long bothered me that SQL doesn't have a way of saying "join these two tables based on the key relationships I've already defined in my schema". (Or maybe some variants do?)

The database already knows that guest.reservation_id is a foreign key referencing reservation.id. Why should you have to repeat yourself?

Someone 3725 days ago

I have longed for that, too, but I'm not sure I should want it.

Problem is that it becomes problematic as soon as you keep such queries around (in stored procedures, source code, or batch files), and then modify your schema by adding or removing a foreign key.

ComodoHacker 3725 days ago

What if you have more then one foreign key referencing the same table? Or worse, what if you had one and later add another? You would get undefined behavior.

Added:

Also SQL language itself has nothing to do with keys, relationships and other details of database design. It's different level of abstraction.

Various query builders/SQL generators do what you want.

greggyb 3725 days ago

And when there are multiple foreign keys between two tables?

kbsletten 3725 days ago

Genuinely curious, what's the use-case for this?

level3 3725 days ago

Some examples:

In project management, tasks might have an assigner_id and assignee_id

In transaction management, transactions might have a sender_id and receiver_id

In sports, matches might have a team1_id and team2_id

jonahx 3725 days ago

> In transaction management, transactions might have a sender_id and receiver_id

in normal form, shouldn't you have "transactions", "transaction_senders", and "transaction_receivers" anyway? similarly for the other cases?

i know sometimes the simplicity of a single table is a valid choice, so your point still stands, but it does feel like the examples are cases of "well, if you're were doing the right way(TM), that wouldn't happen"

dsp1234 3724 days ago

shouldn't you have "transactions", "transaction_senders", and "transaction_receivers"

were doing the right way(TM), that wouldn't happen

Having a separate transaction_senders and transaction_receivers wouldn't be the right way(TM) unless it's possible to have multiple senders/receivers. If a transaction can only have 1 sender/receiver then normalization is complete when the sender/receiver data is removed from 'transactions' table and replaced with the sender_id and receiver_id columns.

When introducing transaction_senders/transaction_receivers tables without further constraints, it would be immediately possible to have multiple transaction senders and multiple transaction receivers for a single transaction, which is likely an error.

level3 3724 days ago

These examples depend on the use case. If you are guaranteed to only have one sender and one receiver, then having them in the same table is already normal form.

Perhaps (?) a less controversial example is a table of flights between airports. Should a single flight have departure_ids and destination_ids? (assuming we treat legs as separate flights)

Or what about package delivery? Do we need sender_ids and receiver_ids there?

cname 3725 days ago

created_by and updated_by both pointing at the user table?

slagfart 3725 days ago

Explicit foreign keys perform better in some joins, as the optimiser can assume that all possible values are in the source column of that foreign key.

vog 3725 days ago

> join these two tables based on the key relationships I've already defined in my schema

This is almost exactly what "NATURAL JOIN" does. It is like "JOIN USING" with a pre-filled column list.

However, this isn't decided based on the schema, but based on equal column names. So you need to apply the dicipline as noted by the parent comment.

See also the sibling discussion thread at: https://news.ycombinator.com/item?id=11450187

ComodoHacker 3725 days ago

> the field will be automatically deduplicated, so you don't have to qualify it elsewhere in the query, and "SELECT " will return only one copy.

You still have in outer joins. Also, SELECT * is a bad practice (with some exceptions) when used in code, suitable only for ad-hoc queries.

huuu 3725 days ago

Agreed. Or at least put a table name in front of the asterisk:

  SELECT
      table1.*,
      table2.field
  FROM table1
  JOIN table2
      ON table2.rel_id = table1.id

ComodoHacker 3725 days ago

table.* doesn't really do any better. Database evolves, table's structure changes and you get various bugs and unpredictable behavior. From elusive exceptions to memory hogs to network bottlenecks, depending on DBMS, data itself and data access framework you use.

huuu 3725 days ago

It does better when, for example, you have a field called 'id' in both tables.

  SELECT
      *
  FROM table1
  JOIN table2
      ON table2.rel_id = table1.id

will use one of the 'id' field from table1 or table2 (depending on the join type).

  SELECT
      table1.*
  FROM table1
  JOIN table2
      ON table2.rel_id = table1.id

will use the 'id' field of table1 and does not include data from table2.

But I agree with your other arguments.

petepete 3725 days ago

Or if you're feeling lucky:

SELECT * FROM reservation NATURAL INNER JOIN guest;

jessedhillon 3725 days ago

Err, be careful with natural joins: suppose both tables have a column called "name" or "created_at" -- natural join will create join conditions from those.

marcosdumay 3725 days ago

I never understood why the thing named "natural join" don't use relationships to determine the joining columns. Does not look natural for me.

vog 3725 days ago

Indeed, this is also what always annoyed me.

I suspect that this has to do with the fact that NATURAL JOIN is intended to work for sub-SELECTs and VIEWs, too. On those, it is quite hard and error-prone to determine "foreign key" equivalents.

Also note that restricting NATURAL JOINs to tables wouldn't be a good solution here, either, because that would prevent you from ever JOINing more than 2 tables: When JOINing 3 tables, you effectively do have some kind of sub SELECT after having joined the first two tables, which is then supposed to be joined with the third one.

marcosdumay 3725 days ago

You get the same sort of problems jessedhillon was talking about.

That is, you lose absolutely nothing - just changes its semantics.

MichaelBurge 3725 days ago

That's why he said, "If you're feeling lucky".

sjwright 3725 days ago

Which in the world of databases is a synonym for "useless". Typical boutique SQL query strings can be anything from a few to a few dozen lines and the compiled form is typically cached by the SQL engine (if you're using parameterised variables, which you are, because not doing so is stupid and dangerous).

So why not be explicit?

sokoloff 3725 days ago

I don't like using select * as it will cause the shape of the result set to change upon schema modifications. We use a DB access layer that "knows" the column indices of result sets at compile time and our zero-downtime process relies on that not changing (for the time period where old code is running against new schema).

Imagine two two-column tables, A with id and name, and B with id and value. (Doesn't matter what the columns are; this is just for concreteness.)

select * from A JOIN B using (id) will give you a three column result set, id, name, value.

Now, suppose you want to do a schema modification to add "last _login" to A. Suddenly, your query is returning a 4 column result set, id, name, last_login, value and code using ordinal positions is now reading last_login into the "value" slot.

Maybe it's a corner-case for our use case and zero-downtime release processes, but it's something to be aware of anyway.

lccarrasco 3725 days ago

Most drivers or interfaces to the db let you use the column's name instead of the ordinal positions, precisely because that decouples the query's result from the things you care about, might be worth checking ^^

level3 3725 days ago

I'm curious: what are the advantages of having your access layer rely on column indices rather than column names?

sokoloff 3725 days ago

At coding-time (pre-compilation), we create classes representing each stored procedure that we've marked for code generation. That gives parameter list (inc auto-completion in the IDE), column names, types/type safety on inputs and outputs, etc, from the parameter list and result set shape and it was [slightly] easier to write the generic code that binds columns in the result set by column index, since we "knew" the column indices wouldn't change in the result set.

(We had a prior system that relied on column order and so had already banned select * and only allowed additions to the result set on the right side before we even got to the point of doing codegen or zero-downtime releases. This was 2003 that we created the codegen wrapper and 2005 or so when we switched to zero downtime releases.)

I guess "because we're lazy" is the somewhat overly glib answer. ;-)

rusanu 3724 days ago

I don't like this because it leads to the misleading assumption that it doesn't matter which value you refer (left or right), since they are equal. But once you consider collations is easy to see the fallacy:

   left.id = 'Something' COLLATE case insensitive
   right.id = 'SOMETHING' COLLATE case insensitive

   SELECT id FROM left JOIN right USING (id);

Will the value be 'Something' or 'SOMETHING'? Better avoid the surprise and be specific...

NicoJuicy 3725 days ago

I like using id as key, if you want to "decouple" info, i just use reservation.id as reservation_id

This way, my db fetch is always on id and not a longer name ( reservation_id), which on the end of an webapp, makes me save some reading / writing :)

ciroduran 3725 days ago

And if you're using a schema you did not create, don't forget to use shorthands! SELECT * FROM reservation r JOIN guest g ON (r.id = g.reservation_id), queries can get messy if you don't use them! :)

swah 3725 days ago

Another option:

    SELECT * FROM reservation r JOIN guest USING (r.id);

piers 3725 days ago

I really hope you're using SELECT * as an example and that you don't really do that.

voltagex_ 3725 days ago

In which dialect?

ciroduran 3725 days ago

It's part of the SQL-92 standard http://flylib.com/books/en/4.343.1.38/1/

rlonstein 3725 days ago

But, but, but ActiveRecord doesn't do that by default so it can't be a good idea [sarcasm]