Hacker News new | ask | show | jobs
by MattRogish 5100 days ago
As long as you can represent your query with predicate logic and not violate set theory (or other relational tenants), it's perfectly "relational".

It's important to note that folks have figured out how to extend Codd's original relational algebra with stuff like aggregation. As the OP mentioned, "Relational" doesn't mean "What Codd wrote in a single paper back in 1969". It has continued to evolve, both with Codd's direct involvement and from successors like Date, Darwin, and Pascal. Codd wasn't an all-seeing, all-knowing data-management demi-god - but his general theory of relational database management and the core tenants are still super awesome. Extensions to it, as long as they don't violate the RM, are just as valid as Codd's original work.

That means #1 is totally relational (as an aside, you don't need ORDER BY or LIMIT for it either). Indeed, relational algebra supports aggregations (http://en.wikipedia.org/wiki/Relational_algebra#Aggregation).

It's important to separate the query language from core RDBMS theory, as the two are orthogonal. Codd suggested Relational Algebra as a reference language but never intended for it to be the only way to communicate with a RDBMS.

See CJ Date's excellent discussion on ORDER(BY): http://books.google.com/books?id=WuZGD5tBfMwC&lpg=PA163&...

EDIT: It even seems folks have figured out how to make "LIMIT" relational operator: http://stackoverflow.com/questions/10229535/relational-algeb...

I have not read the paper, so I cannot discuss the validity of the approach.

1 comments

That means #1 is totally relational (as an aside, you don't need ORDER BY or LIMIT for it either).

I would love to see it. Yes, you can do it in SQL, but I'd say it's not easy at all without ORDER BY and LIMIT or windowing functions and I don't know if you can do it in Tutorial D. For the reference, #1 is:

Show the blog post with the largest number of comments. If more than one exist, pick the latest.

The schema is:

    post(id integer, created timestamp)
    comment(id integer, post_id integer)
See CJ Date's excellent discussion on ORDER(BY)

I read it and the book as well, but I wouldn't call it excellent. What I read there is a reluctant admission of failure to incorporate an important operation to his query model. I see no attempt to analyze why it doesn't work or adapt the query model to make ORDER a regular operation.

It's been a loooong time since I wrote any longhand Relational Algebra, so I'll cheat and use SQL. All of this can be done pretty easily with relational algebra primitives.

And we can make up any operator we want as long as it uses a primitive, so (to save typing I created a view, but you could copy-pasta). I added a "Title" to post because otherwise you could skip it entirely and just use the comment table twice, but where's the fun in that?

<pre> CREATE VIEW counts AS ( select count() AS comment_count, post_id from comment group by post_id )

SELECT id, title, comment_count FROM post p INNER JOIN counts AS c1 ON p.id = c1.post_id WHERE NOT EXISTS( SELECT FROM counts AS c2 WHERE c2.post_id != p.id AND c2.comment_count > c1.comment_count ) </pre>

Thank you. I accept your answer with the note that you ignored my request to return only the latest post when there are more then one posts with the same number of comments, but it's not hard to adapt you query to satisfy this requirement.

However you can't do the same trick if I ask you to return the top 3 posts with the largest number of comments; or, to make the query more realistic, ask you to return the percentage of comments generated by the top 10% popular (by the number of comments) posts. Which is my point: pure relational algebra as advocated by Date et al in Tutorial D is less expressive than SQL, which probably explains the cold reception it got from the industry.

Edit: now that I think about it, you could do it without ORDER BY/LIMIT, but still it's harder than necessary.

I was about to prepare the query without ties and yes it is a lot harder than necessary and, more important IMO, much less readable/intuitive for people than have to maintain the code and that is exactly why different SQL dialects introduced ORDER BY/LIMIT/OFFSET/TOP/RANK etc.

But let me be a little bit picky about this and Date's view on ORDER BY.

The relational model deals just with the algebra/calculus without getting into the details of a language based on the model.

In the book pointed out by Matt, Date explicitly states he's not saying ORDER BY is not useful, just that it doesn't return a relation and thus it's not included in the algebra.

However a language based on the relational model, like Tutorial D, can include such an operator. To be double sure, I checked on The Third Manifesto V2 and a LOAD operator with an ORDER specification is defined in the context of the "special cased" support for arrays.

You can see a couple of paragraphs if you go here http://books.google.it/books?id=X85QAAAAMAAJ&dq=editions... and search for ORDER and LOAD (page 118).

TBH I'm not even sure it's a good idea to introduce arrays for ordering, but, anyway, back to the quota queries: we agree that regular aggregation operators are enough although the query becomes very complex.

In the same book referenced by Matt there's an exercise (7.14) showing how to do a quota query and you can see that even in Tutorial D it's complex.

However, in the solution, Date & Darwen also propose something else: to add a more specific RANK operator which is really just syntactic sugar to simplify this kind of queries. With the important difference, compared to ORDER, that it still returns a relation and not an ordered sequence of tuples.

Unfortunately the whole solution to the excercise is not available through Google Books preview, and the operator is formally defined elsewhere, but you can see how such RANK operator would work here http://books.google.it/books?id=WuZGD5tBfMwC&lpg=PA163&#...

Just a final comment about Date being reluctant to analyze the matter, unfortunately his work is disseminated in a lot of books (and he changed his position on quite several matters throughout the years).

I love the "SQL and Relational Theory" one but, having red all of his books, I would be hesitant to suggest it unless one already knows Date. I think the latest edition (8th) of "An Introduction to Database Systems" is still the best book to start with.

I was about to prepare the query without ties and yes it is a lot harder than necessary and, more important IMO, much less readable/intuitive for people than have to maintain the code and that is exactly why different SQL dialects introduced ORDER BY/LIMIT/OFFSET/TOP/RANK etc.

Very good point.

In the book pointed out by Matt, Date explicitly states he's not saying ORDER BY is not useful, just that it doesn't return a relation and thus it's not included in the algebra.

My biggest gripe about ORDER BY, LIMIT and relational model is the fact that while Date and others made some attempts to express these operations in terms of relational algebra, they never (AFAIK) tried to do the opposite: alter the relational query model to naturally support them. It's not hard: just replace sets with sequences or arrays. It will gives you natural ORDER and SLICE operators as well as new aggregates FIRST, LAST, NTH. It solves duplicates without having to introduce bags, gives windowing functions for free and probably better represents how modern RDBMS interpret a query. Another hint why sequences may work better than sets is the fact that regular set operations such as INTERSECT and UNION (as opposed to UNION ALL, which becomes concatenation) are so rarely used in real-world queries.

I'm not even arguing that this is a good approach, but I think it deserves some discussion and it appears they never even thought of a possibility of changing the model treating it not as an instrument, but as a sacred scripture.

First I don't know why order by, limit, offset, or windowing functions, can't be said to return a relation if we define relations in a way which is sufficiently useful to include these operations. In other words, they are used in ways which returns sets of tuples (or sets of entities if you want to see it that way), based on specific selection criteria.

I would thus agree that to the extent that these are not part of the relational model this says more about the incompleteness of that model than it does about the operations themselves.

(Someone figured out LIMIT for relational algebra, and ORDER can be coerced to relational if you define a bunch of edge cases. I don't think it's fair to suggest that a RDBMS can't/shouldn't/wouldn't do LIMIT or ORDER).

I agree, relational algebra sucks to code in. That wasn't the point of it, of course. Codd's goal was to prove as long as your language is reducible to relational algebra, you're relational. And with that, you get all the side benefits.

I'm not so sure Date et al are advocating coding purely in RA. At least, I've never heard them say that. Their "Tutorial D", as the name implies, is for educational and specification purposes more than a useable implementation. Again, if you can make your awesome language reducible to "Tutorial D", you get all the nice benefits of the True RDBMS, which is pretty awesome.