Hacker News new | ask | show | jobs
by kiyoto 4259 days ago
For several reasons, some more legitimate than others.

1. kdb+ was (and maybe is) a good solution to the problem that we had: doing complex data manipulation/simple statistical calculations against billions of rows of time series data. Hadoop is the term du jour for data processing, but truth of the matter is that finance doesn't have really huge data. At best, it's a couple of terabytes, and most of the time, you are working with a small subset of it. Running KDB+ on a beefy server or two would usually do the job (rather well).

2. Maybe because I studied math, but I find k/q's vectorial/functional sematics appealing. I think the syntax is horrible, but the semantics is very neat.

3. Finally, because it helped me keep my job. It was rather amazing to me that all these Ph.D. statisticians that I worked with couldn't bring themselves to learn kdb+ effectively. Apparently this stuff can be very hard for even the smartest people (or maybe they thought it was such a niche skill with a low ROI).

1 comments

It almost sounds like kdb needs an alternative syntax that is more human readable.
It has one: q. But once you get over the syntax, you realize that you also need to grok different semantics that you are used to.

Some q is readable english - e.g., an expression like

   sum price where size>3
is (to the uninitiated) more readable than the equivalent k

   +/price@&size>3
but that only works for simple stuff. The (idiomatic!) computation of maximum-subarray-sum[0]

   |/0(0|+)\
becomes

   max over 0 (0 max +) scan
which is not more readable. And you can drive the point ad absurdum by making it even more verbose:

   max over zero (zero max plus) scan
The syntax seems like it is what stops you from understanding it because it is the first thing you meet. But it's the semantics that you need to grok, and the syntax just matches them.

[0] http://en.wikipedia.org/wiki/Maximum_subarray_problem

I would find something like

over(max, scan(0, max(+, 0))

slightly more readable, even if that still gives me a higher-order-headache.

Because, y'know, it's confusing to have symbol salad with a weird mixture of infix and postfix operators, even more so if you have type raising (or the moral equivalent thereof) thrown in for good measure.

The q in your latter example is the same complexity in my book as a dense Python list comprehension. Could be worse.
The equivalent python is:

    mss = lambda x: max(scan(lambda a,b: max(0,a+b),0,x))
assuming a definition "scan" (which is like "reduce", except it gives you all intermediate values), an example of which is:

    def scan(f,x0,x):
      r = [x0]
      for x1 in x:
        x0 = f(x0, x1)
        r.append(x0)
      return r
Note that the K is idiomatic whereas the python is (arguably) not. Of course, it could be worse; the advantage is that, much like math, the 9-char K version is pattern-matched by your eyes once you are familiar with it, whereas no other version presented here (or in almost any other language) can utilize that feature of your brain.
Oh goodness... don't do scan like that. Way easier (and more efficient) to use a generator:

    def scan(f, iterator, initial=0):
        yield initial
        yield from scan(f, iterator, f(initial, next(iterator)))
Even with python2, you could make a non-recursive version that'd be still shorter than your scan and faster. Alternatively, you could pair a coroutine with that reduce function....

But always be suspect of your code if you are iterating and appending to a list. Likely there is a much better way.

First, it is not equivalent - next() cannot apply to range() output, for example - you will need to do some iter() games and watch out for iteration order side effects if your values are iterators vs. lists.

Second, it is ~10% faster, but that speed difference disappears completely if you eliminate the namespace lookup (that is, add e.g. "o = r.append" before the loop, and call o() instead of r.append() inside the loop). It potentially uses less memory - but not the way you did it (unless Python 3 gained TCO when I wasn't looking. Did it?) - your formulation does not load the call stack, but it does create len(iterator) generators that - until the innermost StopIteration - all need to live somewhere on the heap. recursive solutions without TCO are rarely good enough to replace iteration.

Even if you did it right, it's more efficient, but not significantly so timewise, and slightly easier to use iterators in general, yes. It is mostly space-efficient in general.

I think it is more idiomatic, though - and also Python2 compatible - to just replace references to 'r' with yield in my code, than using the recursive definition you gave above - which is more idiomatic in functional languages, but less in Python (and harder to debug in any language than the iterative version)