Hacker News new | ask | show | jobs
by MattRogish 4820 days ago
I really like SQL. Sure, the language has warts but the ability to concisely represent WHAT you want, not HOW you want it, makes it very readable once you understand the simple constructs and how to properly design tables and indexes (not very hard).

For example, consider the problem of finding the second largest value in a set.

In SQL, I'd do something like:

  SELECT MAX( col )
    FROM table
   WHERE col < ( SELECT MAX( col )
                   FROM table )
  
It's pretty readable, and can almost be read in plain english: "Get the next biggest value from the table where it's smaller than the biggest value."

How might you do this in Java? http://stackoverflow.com/questions/2615712/finding-the-secon...

But look at all the other ways you can do it in that thread. None of them are very readable. And, they can hide subtle bugs that you won't find just by reviewing the code.

Ruby has a pretty concise example if you happen to know the trick, and that the sort performance isn't miserable (kind of a gotcha question): http://stackoverflow.com/questions/8544429/find-second-large...

This is a very simple example, but as you scale up to more complex problems I almost always find SQL is fewer lines of code, more readable, and far less buggy (SQL tends to either work or not work - I find much more subtle bugs in a 30 line Java algo than a 10 line SQL command).

4 comments

FWIW here's a very similar Python version:

    max(col for col in table if col < max(table))
a fun one is this:

    first, second = max(permutations(table), 2)
although its semantics are slightly different (if the maximum of the table is duplicated, it'll be returned for both slots)

or using heapq which notaddicted mentioned.

I think the similitude is deceptive. Since declarative programming is side-effect free, the runtime can perform the operation in the most efficient way it knows. Python isn't, so it has to ensure some execution order that breaks possible optimizations.

In your example, if you replace max() with a function that prints something, you'll see it's executed for each value in 'table', which is extremely inefficient. This happens because Python can't guarantee that max() will return the same each time.

Similarly, while in a side-effect free context the runtime could, for example, slice 'table', perform the work using multiple concurrent threads and then join the result, Python has to guarantee that the execution is done sequentially, since a change in order could affect the result.

Unlike in SQL, in Python you're always telling it HOW you want it done.

> I think the similitude is deceptive.

It's not.

> Since declarative programming is side-effect free, the runtime can perform the operation in the most efficient way it knows.

The point of declarative programming is to declare what you want done. The runtime may turn it into greatness or crap, that's not really relevant.

> In your example, if you replace max() with a function that prints something, you'll see it's executed for each value in 'table', which is extremely inefficient.

It's also not relevant and an implementation detail, with a known max() on a known type the implementation would be free to lift the computation out since this expression doesn't have side-effects. Just as a crappy SQL DB would be free to run the subquery once for each result.

> This happens because Python can't guarantee that max() will return the same each time.

The runtime can know that all types and functions involved are their native side-effect-less versions and is free to optimize them if it wishes, that it does not is irrelevant.

> Unlike in SQL, in Python you're always telling it HOW you want it done.

Nope, sorry, you're wrong.

> The point of declarative programming is to declare what you want done. The runtime may turn it into greatness or crap, that's not really relevant.

I didn't say the quality of the runtime was relevant. The point is that the runtime is free to achieve the goal you give it in any way it feels fit. In Python, the runtime isn't; it needs to ensure specific runtime guarantees.

>> In your example, if you replace max() with a function that prints something, you'll see it's executed for each value in 'table', which is extremely inefficient.

> It's also not relevant and an implementation detail, with a known max() on a known type the implementation would be free to lift the computation out since this expression doesn't have side-effects.

No, it can't. The Python reference specifies the semantics of generator expressions, and it says "Only the outermost for-expression is evaluated immediately, the other expressions are deferred until the generator is run"[1].

A implementation that lifted the computation would not be implementing Python, but a derivative language.

Furthermore, your example didn't specify neither the type of neither 'table' nor 'max', so I still think it's deceptive.

> The runtime can know that all types and functions involved are their native side-effect-less versions and is free to optimize them if it wishes, that it does not is irrelevant.

See above. According to the language spec, it can't.

[1]: http://www.python.org/dev/peps/pep-0289/#the-details

Aside: Python is similar to Ruby, albeit using the sorted() function rather than the sort() method.

    sorted(vals)[-2]
Also available in python:

  import heapq
  n = 2 
  heapq.nlargest(n, vals)[n-1]
Edit: and out of morbid curiosity, here is the same implemented by a scan. It could be arranged as a single statement.

  swapIfGt = lambda xs, q: xs if xs[0] >= q else sorted([q]+xs[1:])
  n = 2
  reduce(swapIfGt, vals, [None]*n)[0]
Edit2: rewrote swapIfGt
The pedant in me squirms at seeing a quadratic-time solution to something so linear. Use a heap or a scan, sir!
Oh, now I have to yell at myself. No doubt Python's sorted is some n*logn quicksort, and not actually in quadratic time. Crow for all!
It's an adaptive mergesort-based sort: http://en.wikipedia.org/wiki/Timsort
Still, I'd rather have a loop and be linear than using a sort and being linearithmic.

    maximum $ filter (< maximum xs) xs
or like the Ruby one, but with better error semantics

    (`atMay` 1) . reverse . sort
since we don't know that there exists such a column.
The important part is that abstracting the implementation from the declaration, the DBMS is free to compute this using whatever index, sorts and memory allocation is has too. So I think the way you do this in other languages doesn't even compare, because it's not really the same thing.

The effect is you end up with a declaration that is highly intelligible, exactly because you don't have to write the implementation.

> The important part is that abstracting the implementation from the declaration, the DBMS is free to compute this using whatever index, sorts and memory allocation is has too. So I think the way you do this in other languages doesn't even compare, because it's not really the same thing.

The ruby method-based form really is the same thing, because the semantics of teh result are constant across different enumerable objects (conventionally, of course, nothing about the language enforces this), but different enumerables may well implement the behavior differently under the hood -- including using internal indexes -- and may, in fact, apply the same type of adaptive techniques used by an SQL-based RDBMS (or, in extreme cases, actually defer all the work to an SQL-based RDBMS where the data presented with an enumerable interface in Ruby actually lives.)

This is less true, perhaps, of some of the other examples in that the implementation will be the same for all objects in the same system (but you still get some of the conceptual clarity advantages of declarative programming, even if the its not leveraged as effectively behind the scenes by the runtime adapting the actual methods used to the data it is working on.)