Hacker News new | ask | show | jobs
by meritt 2979 days ago
I personally prefer person_id to be the primary key name (instead of id) in both the person table and any table which has it as a foreign key. One reason is for join syntax:

    select * from person join team_member using (person_id)
The other reason is person_id now unambiguously refers to the same field regardless if we're looking at the PK or a FK. It's always person_id.
2 comments

I'm with you but in my experience it's a lost fight. Most of the projects I come across these days follow the id convention.

The advantages of using person_id are even more obvious in multiple joins, such as a star schema where you can using(person_id) all the things, reducing both the typing and the cognitive load.

I suspect that if this convention was more pervasive, programmers would be a little bit less afraid of diving into SQL.

Well, at least we're settling with some convention, so it's not all bad.

EDIT: typo

I use T-SQL, and this is actually why I always join with the table name:

SELECT * FROM Person JOIN TeamMember on PersonId = Person.Id

I'd bring it a little further and would write:

SELECT * FROM Person as P INNER JOIN TeamMember as TM on TM.PersonId = P.Id

I have:

- Aliased each table and prefixed every field names with their table alias in my join conditions.

- Explicited the JOIN type.

The above:

- Reduces mistakes due to ambiguities that tend to generate unwanted duplicates rows in SQL.

- Increases the likelihood of getting an error at parse time, instead of run-time or analysis-time, thanks to added scoping.

- Works in any schema, no matter what naming conventions are followed.

- Keeps working as the query becomes more complex with multiples table aliases or self-joins, and similar field names appearing in the set.

- Better expresses intent. Sure JOIN defaults to INNER JOIN, but writing "INNER JOIN" shows that you genuinely expect any row not matching your condition to be removed from the result set.

I'm not a fan of short aliases. They obscure what you are attempting to do.

Personally I write very little SQL anymore (the ORM does that for me unless I need performance), but the only time I use aliasing is when the same table is joined multiple times.

Like:

SELECT Mother.Name, Father.Name, Child.* FROM Person Child JOIN Person Mother ON Mother.Id = Child.MotherId JOIN Person Father ON Father.Id = Child.FatherId

That said, this naming convention also obscures what Person.FatherId points to without looking at the FKeys. So Take that with a grain of salt.

As a newbie, this makes more sense to me:

  SELECT * FROM person, team WHERE person.team_id=team.id
(no unnecessary aliasing, no weird JOIN phrasing)
I can understand not using aliasing in a simple query, but I'd recommend against using these non-ANSI joins (deprecated syntax). What you're writing can be interpreted as either:

SELECT * from person CROSS JOIN Team WHERE person.team_id=team.id

SELECT * from person INNER JOIN Team ON person.team_id=team.id

That they happen produce the same result in a query is practically just luck. Changing the join from INNER to LEFT OUTER is also much easier than managing (+)'s in the WHERE clause, once you're used to it.

This is good advice, IMO. You will have to use left/right joins at some point, and having the join type at the join location is useful, but having the join conditions at the join location is immensely useful.

It's harder to see in your example, likely because you tried to keep it similar to the example presented, but once there's more than a couple joins, having the join conditions close to the join is essential for keeping track of what's going on. e.g.

  SELECT * FROM person, team, role, group, person AS lead
  WHERE person.team_id=team.id
    AND person.role_id=role.id
    AND person.group_id=group.id
    AND group.lead_id=lead.id
Compared to:

  SELECT * from person
    INNER JOIN team ON person.team.id=team.id
    INNER JOIN role ON person.role_id=role.id
    INNER JOIN group ON person.group_id=group.id
    INNER JOIN person AS lead ON group.lead_id=lead.id
And let's be clear, we know these should be left joins, because the chance some person doesn't have a team, role, group, or a group ends up without a lead is high when sampled over time. And simulating left joins with the non-ANSI joins quickly gets unwieldy.
Aliasing a query that short just obfuscates intention. It might be necessary for some larger queries, or joining the same table multiple times, but there is a readability cost you are paying for it.

For example:

JOIN team_member ON team_member.person_id = person.id

It's completely self evident what is being joined without the need to trace back to the table aliases.

Pretty much all of your advice is premature optimization in my eyes. You can/should do those things when they are needed, but there is no reason to automatically write every single SQL query that explicitly.

I prefer to keep ID column names descriptive even if it does lead to repetition like Person.PersonID. That way columns that identify a person always carry the same name and you are never left guessing what a more anonymous "ID" refers to or fall into one of a couple of traps where the parser disambiguate one in a way you were not expecting (though this is also caught by consistently using two+ part names when referring to columns, which I also prefer to do). It is particularly useful if the same entity is joined into a query multiple times with different aliases.

There are cons, of course. This is a matter that divides people and when working with other people's projects you have to ignore your own preference and follow the "local" convention.

Although the alternative allows for a bit of metadata to be encoded in the name. For example, if "id" is always the local item descriptor, you can almost always assume anything ending in "ID" or "_id" is a foreign key, and either the table name or relation is encoded in the prior part of the name.

It's a small thing. You'll likely know enough about the tables to be able to know this information anyways. Then again, knowing which "id" field you want is pretty obvious too. In the end, there are pros and cons to both, and it's mostly preference.

That said, once it leaves the database, I much prefer my records have short id fields, which likely influences by schema design to some degree.

Agreed. For me, "id" means the PK (or at least just some surrogate key), and "*_id" is an FK.
If I'm going to use in-name property marking[1] then I'll use PK for the PK, so instead of Person.idPerson that would be person.pkPerson (and perhaps fkPerson instead of idPerson in child tables).

I used to just use ID as the name for surrogate primary keys, but find being more explicit to be helpful for clarity. It is one of a number of habits I used to have in the name of being concise that I now prefer not to do these days in the name of being descriptive.

[1] which I do for keys themselves and other table supporting objects[2], the PK for a table called Thing is explicitly named pkThing rather than letting SQL Server pick a name [2] for instance an index on Thing covering col1 and col2 is ix_Thing_col1_col2 (unless the index exists for a fairly specific reason that is somehow unclear from what it covers/includes in which case that reason is in the index's name)

An interesting convention. I'm personally of the opinion that since I'm explicitly defining foreign key constraints and relationships in the database anyway, and that's easily viewed by querying the table definition, putting much more in the name itself is both redundant and causes confusion (or at least extra work and annoyance) when used outside the database.

I don't name the primary key field id because I'm denoting a type, I call it id because it holds the identifier for the current record. I don't call the foreign relation identifier person_id because it holds a "person" and is an id, I call it person_id because it holds the identifier for a person record, and that's a valid description for what the field holds. The constraint tells how the relation is defined.

> the PK for a table called Thing is explicitly named pkThing rather than letting SQL Server pick a name [2] for instance an index on Thing covering col1 and col2 is ix_Thing_col1_col2

So, do you call those fields that index works on col1_ix1 and cal2_ix1 or something? If not, why do you denote the primary key with PK, and not indexed columns? Both have explicit definitions in the schema that specify exactly how they are defined, one a primary key, the other an index over multiple columns.

To me this sounds a bit like how Hungarian Notation[1] had it's original purpose lost as it shifted over time to encompass a larger, less well defined set of behavior which didn't always deliver an actual benefits.[2] To me, if the system is enforcing the values, and you can introspect the system, that's good enough.

That said, I did spend a while reading an IDEF1X explanation and HOWTO last night, and there are some interesting points and arguments there, so I'm not entirely set in stone with these views.

1: https://en.wikipedia.org/wiki/Hungarian_notation#Systems_vs....

2: TL;DR "Apps Hungarian" prefixed variable names with info about what they were supposed to contain at a high level, such as rwFoo containing the row of Foo. Systems Hungarian came along later and they used the prefix to encode the type, such that iFoo might mean an integer Foo. The benefit of this in a language such as C or C++ that requires explicitly typing the variable anyways is debatable.

3: http://www.softwaregems.com.au/Documents/Documentary%20Examp...