Hacker News new | ask | show | jobs
by xi 5100 days ago
ORDER BY, LIMIT/OFFSET have to do with presentation of data. So, although it's highly desired that a language based on the relational model supports them, they have nothing to do with the model per se.

By fit poorly I mean that you cannot express most real-world business inquiries using pure relational primitives without ORDER BY or LIMIT/OFFSET and that's why I think relational algebra is not usable per se. SQL fixed this problem by adding many non-relational constructs, but but without any sense of consistency or direction.

I also strongly disagree that ORDER BY and LIMIT/OFFSET are presentational operations since I often use them not only for wrapping the outer SELECT, but also within correlated subqueries.

To show some proof, here are a few queries which are hard or impossible to express in relational algebra:

1. Show the blog post with the largest number of comments [^].

2. Show the tags associated with the blog post with the largest number of comments.

3. For each blog category, show the 3 top blog posts by the number of comments.

[^] If more than one exist, pick the latest.

NULL is a completely different beast and this is the only real thing one can consider problematic.

I think NULL is only hard because relational model is a wrong way to look at the data. If you see an entity attribute not as a column of a tuple, but as a function from an entity set to some value domain, the fact that the attribute is nullable just means that the function is not total. There is a well developed mathematical apparatus for partial functions, in which NULL becomes a bottom value injected to the value domain, and tri-valued logic is simply a monotonic extension of regular Boolean operators.

2 comments

As long as you can represent your query with predicate logic and not violate set theory (or other relational tenants), it's perfectly "relational".

It's important to note that folks have figured out how to extend Codd's original relational algebra with stuff like aggregation. As the OP mentioned, "Relational" doesn't mean "What Codd wrote in a single paper back in 1969". It has continued to evolve, both with Codd's direct involvement and from successors like Date, Darwin, and Pascal. Codd wasn't an all-seeing, all-knowing data-management demi-god - but his general theory of relational database management and the core tenants are still super awesome. Extensions to it, as long as they don't violate the RM, are just as valid as Codd's original work.

That means #1 is totally relational (as an aside, you don't need ORDER BY or LIMIT for it either). Indeed, relational algebra supports aggregations (http://en.wikipedia.org/wiki/Relational_algebra#Aggregation).

It's important to separate the query language from core RDBMS theory, as the two are orthogonal. Codd suggested Relational Algebra as a reference language but never intended for it to be the only way to communicate with a RDBMS.

See CJ Date's excellent discussion on ORDER(BY): http://books.google.com/books?id=WuZGD5tBfMwC&lpg=PA163&...

EDIT: It even seems folks have figured out how to make "LIMIT" relational operator: http://stackoverflow.com/questions/10229535/relational-algeb...

I have not read the paper, so I cannot discuss the validity of the approach.

That means #1 is totally relational (as an aside, you don't need ORDER BY or LIMIT for it either).

I would love to see it. Yes, you can do it in SQL, but I'd say it's not easy at all without ORDER BY and LIMIT or windowing functions and I don't know if you can do it in Tutorial D. For the reference, #1 is:

Show the blog post with the largest number of comments. If more than one exist, pick the latest.

The schema is:

    post(id integer, created timestamp)
    comment(id integer, post_id integer)
See CJ Date's excellent discussion on ORDER(BY)

I read it and the book as well, but I wouldn't call it excellent. What I read there is a reluctant admission of failure to incorporate an important operation to his query model. I see no attempt to analyze why it doesn't work or adapt the query model to make ORDER a regular operation.

It's been a loooong time since I wrote any longhand Relational Algebra, so I'll cheat and use SQL. All of this can be done pretty easily with relational algebra primitives.

And we can make up any operator we want as long as it uses a primitive, so (to save typing I created a view, but you could copy-pasta). I added a "Title" to post because otherwise you could skip it entirely and just use the comment table twice, but where's the fun in that?

<pre> CREATE VIEW counts AS ( select count() AS comment_count, post_id from comment group by post_id )

SELECT id, title, comment_count FROM post p INNER JOIN counts AS c1 ON p.id = c1.post_id WHERE NOT EXISTS( SELECT FROM counts AS c2 WHERE c2.post_id != p.id AND c2.comment_count > c1.comment_count ) </pre>

Thank you. I accept your answer with the note that you ignored my request to return only the latest post when there are more then one posts with the same number of comments, but it's not hard to adapt you query to satisfy this requirement.

However you can't do the same trick if I ask you to return the top 3 posts with the largest number of comments; or, to make the query more realistic, ask you to return the percentage of comments generated by the top 10% popular (by the number of comments) posts. Which is my point: pure relational algebra as advocated by Date et al in Tutorial D is less expressive than SQL, which probably explains the cold reception it got from the industry.

Edit: now that I think about it, you could do it without ORDER BY/LIMIT, but still it's harder than necessary.

I was about to prepare the query without ties and yes it is a lot harder than necessary and, more important IMO, much less readable/intuitive for people than have to maintain the code and that is exactly why different SQL dialects introduced ORDER BY/LIMIT/OFFSET/TOP/RANK etc.

But let me be a little bit picky about this and Date's view on ORDER BY.

The relational model deals just with the algebra/calculus without getting into the details of a language based on the model.

In the book pointed out by Matt, Date explicitly states he's not saying ORDER BY is not useful, just that it doesn't return a relation and thus it's not included in the algebra.

However a language based on the relational model, like Tutorial D, can include such an operator. To be double sure, I checked on The Third Manifesto V2 and a LOAD operator with an ORDER specification is defined in the context of the "special cased" support for arrays.

You can see a couple of paragraphs if you go here http://books.google.it/books?id=X85QAAAAMAAJ&dq=editions... and search for ORDER and LOAD (page 118).

TBH I'm not even sure it's a good idea to introduce arrays for ordering, but, anyway, back to the quota queries: we agree that regular aggregation operators are enough although the query becomes very complex.

In the same book referenced by Matt there's an exercise (7.14) showing how to do a quota query and you can see that even in Tutorial D it's complex.

However, in the solution, Date & Darwen also propose something else: to add a more specific RANK operator which is really just syntactic sugar to simplify this kind of queries. With the important difference, compared to ORDER, that it still returns a relation and not an ordered sequence of tuples.

Unfortunately the whole solution to the excercise is not available through Google Books preview, and the operator is formally defined elsewhere, but you can see how such RANK operator would work here http://books.google.it/books?id=WuZGD5tBfMwC&lpg=PA163&#...

Just a final comment about Date being reluctant to analyze the matter, unfortunately his work is disseminated in a lot of books (and he changed his position on quite several matters throughout the years).

I love the "SQL and Relational Theory" one but, having red all of his books, I would be hesitant to suggest it unless one already knows Date. I think the latest edition (8th) of "An Introduction to Database Systems" is still the best book to start with.

I was about to prepare the query without ties and yes it is a lot harder than necessary and, more important IMO, much less readable/intuitive for people than have to maintain the code and that is exactly why different SQL dialects introduced ORDER BY/LIMIT/OFFSET/TOP/RANK etc.

Very good point.

In the book pointed out by Matt, Date explicitly states he's not saying ORDER BY is not useful, just that it doesn't return a relation and thus it's not included in the algebra.

My biggest gripe about ORDER BY, LIMIT and relational model is the fact that while Date and others made some attempts to express these operations in terms of relational algebra, they never (AFAIK) tried to do the opposite: alter the relational query model to naturally support them. It's not hard: just replace sets with sequences or arrays. It will gives you natural ORDER and SLICE operators as well as new aggregates FIRST, LAST, NTH. It solves duplicates without having to introduce bags, gives windowing functions for free and probably better represents how modern RDBMS interpret a query. Another hint why sequences may work better than sets is the fact that regular set operations such as INTERSECT and UNION (as opposed to UNION ALL, which becomes concatenation) are so rarely used in real-world queries.

I'm not even arguing that this is a good approach, but I think it deserves some discussion and it appears they never even thought of a possibility of changing the model treating it not as an instrument, but as a sacred scripture.

(Someone figured out LIMIT for relational algebra, and ORDER can be coerced to relational if you define a bunch of edge cases. I don't think it's fair to suggest that a RDBMS can't/shouldn't/wouldn't do LIMIT or ORDER).

I agree, relational algebra sucks to code in. That wasn't the point of it, of course. Codd's goal was to prove as long as your language is reducible to relational algebra, you're relational. And with that, you get all the side benefits.

I'm not so sure Date et al are advocating coding purely in RA. At least, I've never heard them say that. Their "Tutorial D", as the name implies, is for educational and specification purposes more than a useable implementation. Again, if you can make your awesome language reducible to "Tutorial D", you get all the nice benefits of the True RDBMS, which is pretty awesome.

I disagree with you on NULL. The relational model is a very good way of looking at data in many contexts.

The problem actually is that NULL has several distinct meanings and there is often no real way to differentiate between them other than to disallow all meanings but one, and that is often difficult.

For example you talk about nullable columns, and this is one aspect of NULLs. NULLs may mean missing data. They also are often used to mean the data doesn't apply. This already runs you into ambiguity problems because you can't type in a query that easily distinguishes whether the attribute doesn't apply or is merely unknown. Note that Oracle treats NULL strings as equivalent to empty strings, while PostgreSQL tries to differentiate strings by allowing empty strings which are distinct thus allowing a not-applicable value for character string fields.

Additionally, you would expect the || operator to handle unknown data differently than it does data which doesn't apply. string || not_applicable should equal string. string || unknown should equal unknown.

Now that's only the beginning of the problem. There is a third use of NULLs too, namely as a placeholder for missing rows in outer joins......

If we had three different NULL values some of this problem would be more manageable, but the problem is that as soon as you allow nulls in columns, you can't always easily tell from a query on a well-normalized database what that NULL means without a lot of additional introspection of the representative of the entity set and then you are basically guessing.