Hacker News new | ask | show | jobs
by xi 5110 days ago
People conflate "relational" with "SQL", because of the historical accident that SQL is the most popular way to query relational data. Then when SQL isn't a good fit for their problem, they think relational is not a good fit for their problem, which is almost certainly not true.

For most practical purposes, SQL is the only way to query relational data. In the absence of alternatives, it's natural to conflate the notions of SQL databases and relational databases. I agree that SQL is a mess, but I don't think an approach based on pure relational primitives would make it better; in fact, I think SQL is a mess specifically because of lack of expressiveness in the pure relational primitives. NULL, ORDER BY, LIMIT/OFFSET, opaque keys, windowing functions, transitive closures, etc fit poorly into the relational model.

The original motivation for relational databases is to have path-independent access to data. This is a really powerful idea.

I agree with both assertions, but in many applications, path-based access makes the total majority of queries, and SQL or relational model provides little means to distinguish them from other equi-joins or arbitrary join conditions. In my opinion, it would be better to start with a navigational path-based database model, and extend it to allow constructing new paths dynamically.

2 comments

I'm not sure what you mean by fit poorly in the relational model.

ORDER BY, LIMIT/OFFSET have to do with presentation of data. So, although it's highly desired that a language based on the relational model supports them, they have nothing to do with the model per se.

As for opaque keys, I thought the old debate about surrogate/artificial and natural keys was over years ago. The relational model has nothing to say about it, that would be like trying to make a model understand if facts are true in the real world or not. A key is a key.

Moreover, last time I've checked transitive closure operators were defined for the relation model (people should not stop ad Codd papers, Date and Darwen wrote a lot of books, e.g. The Third Manifesto, extending on the original ideas).

NULL is a completely different beast and this is the only real thing one can consider problematic.

In some cases NULL just means the predicate for the relation is different because values for an attribute don't apply, so this is not really a problem, in some other cases we simply don't know the value when we are collecting information, and this is indeed a problem.

Date follows Wittgestein that said we should remain silent about things we can't speak about i.e. we shouldn't collect incomplete information, Codd came up with I-marks, A-marks and n-valued logic, SQL collapsed everything into NULLs and 3-valued logic.

ORDER BY, LIMIT/OFFSET have to do with presentation of data. So, although it's highly desired that a language based on the relational model supports them, they have nothing to do with the model per se.

By fit poorly I mean that you cannot express most real-world business inquiries using pure relational primitives without ORDER BY or LIMIT/OFFSET and that's why I think relational algebra is not usable per se. SQL fixed this problem by adding many non-relational constructs, but but without any sense of consistency or direction.

I also strongly disagree that ORDER BY and LIMIT/OFFSET are presentational operations since I often use them not only for wrapping the outer SELECT, but also within correlated subqueries.

To show some proof, here are a few queries which are hard or impossible to express in relational algebra:

1. Show the blog post with the largest number of comments [^].

2. Show the tags associated with the blog post with the largest number of comments.

3. For each blog category, show the 3 top blog posts by the number of comments.

[^] If more than one exist, pick the latest.

NULL is a completely different beast and this is the only real thing one can consider problematic.

I think NULL is only hard because relational model is a wrong way to look at the data. If you see an entity attribute not as a column of a tuple, but as a function from an entity set to some value domain, the fact that the attribute is nullable just means that the function is not total. There is a well developed mathematical apparatus for partial functions, in which NULL becomes a bottom value injected to the value domain, and tri-valued logic is simply a monotonic extension of regular Boolean operators.

As long as you can represent your query with predicate logic and not violate set theory (or other relational tenants), it's perfectly "relational".

It's important to note that folks have figured out how to extend Codd's original relational algebra with stuff like aggregation. As the OP mentioned, "Relational" doesn't mean "What Codd wrote in a single paper back in 1969". It has continued to evolve, both with Codd's direct involvement and from successors like Date, Darwin, and Pascal. Codd wasn't an all-seeing, all-knowing data-management demi-god - but his general theory of relational database management and the core tenants are still super awesome. Extensions to it, as long as they don't violate the RM, are just as valid as Codd's original work.

That means #1 is totally relational (as an aside, you don't need ORDER BY or LIMIT for it either). Indeed, relational algebra supports aggregations (http://en.wikipedia.org/wiki/Relational_algebra#Aggregation).

It's important to separate the query language from core RDBMS theory, as the two are orthogonal. Codd suggested Relational Algebra as a reference language but never intended for it to be the only way to communicate with a RDBMS.

See CJ Date's excellent discussion on ORDER(BY): http://books.google.com/books?id=WuZGD5tBfMwC&lpg=PA163&...

EDIT: It even seems folks have figured out how to make "LIMIT" relational operator: http://stackoverflow.com/questions/10229535/relational-algeb...

I have not read the paper, so I cannot discuss the validity of the approach.

That means #1 is totally relational (as an aside, you don't need ORDER BY or LIMIT for it either).

I would love to see it. Yes, you can do it in SQL, but I'd say it's not easy at all without ORDER BY and LIMIT or windowing functions and I don't know if you can do it in Tutorial D. For the reference, #1 is:

Show the blog post with the largest number of comments. If more than one exist, pick the latest.

The schema is:

    post(id integer, created timestamp)
    comment(id integer, post_id integer)
See CJ Date's excellent discussion on ORDER(BY)

I read it and the book as well, but I wouldn't call it excellent. What I read there is a reluctant admission of failure to incorporate an important operation to his query model. I see no attempt to analyze why it doesn't work or adapt the query model to make ORDER a regular operation.

It's been a loooong time since I wrote any longhand Relational Algebra, so I'll cheat and use SQL. All of this can be done pretty easily with relational algebra primitives.

And we can make up any operator we want as long as it uses a primitive, so (to save typing I created a view, but you could copy-pasta). I added a "Title" to post because otherwise you could skip it entirely and just use the comment table twice, but where's the fun in that?

<pre> CREATE VIEW counts AS ( select count() AS comment_count, post_id from comment group by post_id )

SELECT id, title, comment_count FROM post p INNER JOIN counts AS c1 ON p.id = c1.post_id WHERE NOT EXISTS( SELECT FROM counts AS c2 WHERE c2.post_id != p.id AND c2.comment_count > c1.comment_count ) </pre>

Thank you. I accept your answer with the note that you ignored my request to return only the latest post when there are more then one posts with the same number of comments, but it's not hard to adapt you query to satisfy this requirement.

However you can't do the same trick if I ask you to return the top 3 posts with the largest number of comments; or, to make the query more realistic, ask you to return the percentage of comments generated by the top 10% popular (by the number of comments) posts. Which is my point: pure relational algebra as advocated by Date et al in Tutorial D is less expressive than SQL, which probably explains the cold reception it got from the industry.

Edit: now that I think about it, you could do it without ORDER BY/LIMIT, but still it's harder than necessary.

I disagree with you on NULL. The relational model is a very good way of looking at data in many contexts.

The problem actually is that NULL has several distinct meanings and there is often no real way to differentiate between them other than to disallow all meanings but one, and that is often difficult.

For example you talk about nullable columns, and this is one aspect of NULLs. NULLs may mean missing data. They also are often used to mean the data doesn't apply. This already runs you into ambiguity problems because you can't type in a query that easily distinguishes whether the attribute doesn't apply or is merely unknown. Note that Oracle treats NULL strings as equivalent to empty strings, while PostgreSQL tries to differentiate strings by allowing empty strings which are distinct thus allowing a not-applicable value for character string fields.

Additionally, you would expect the || operator to handle unknown data differently than it does data which doesn't apply. string || not_applicable should equal string. string || unknown should equal unknown.

Now that's only the beginning of the problem. There is a third use of NULLs too, namely as a placeholder for missing rows in outer joins......

If we had three different NULL values some of this problem would be more manageable, but the problem is that as soon as you allow nulls in columns, you can't always easily tell from a query on a well-normalized database what that NULL means without a lot of additional introspection of the representative of the entity set and then you are basically guessing.

As soon as you have arbitrary one to many relationships you have relational data so there are a lot of add hock systems out there. As to larger scale ones I often use linq to do in memory query's of objects with relationships.

More limited examples often fit the hierarchical database model such as graphics API's that have can handle object -> Vertex relationship, and or recursive relationships between objects. But, games frequently need 1:M or M:M relationships between things like factions, so there are plenty of addhock API's out there that fit the relational data model.

People mean different things when they say relational model, so to clarify, by relational model I mean a model in which data is represented as sets of N-tuples of fixed structure, and queries are constructed using set-based operations such as filtering and Cartesian product.

Also, when I say path-based access, I mean access that follows predefined links between entities (in SQL, provided by FOREIGN KEY constraints). Those are well supported by object model and ORMs, as opposed to arbitrary joins, which aren't.

You don't need a relational model to represent one-to-many relations, in fact, an object model such as provided by many ORMs could represent them perfectly. In your first example, `figure.vertices` could be a list of vertices associated with a figure, and `vertex.figure` is the figure which owns the vertex. Similarly mutual object or list references could represent other singular or plural relationships. Though I agree it requires referential loops and cannot be expressed well in a pure hierarchical model such as in many novel no-sql databases.