Hacker News new | ask | show | jobs
by steveBK123 739 days ago
Longtime KDB user here I think you maybe have some misunderstanding personally and some poor engineering at your firm around the the tech/data. Timeseries data particularly market data is exactly the use case the product excels at.

The wire format is compressed.

KDB horizontally scales (even their competitors comparison pages state this - https://www.influxdata.com/comparison/kdb-vs-tsdb/)

A few things to consider that might help - you do not want a solution (in any language/tech) that involves pulling an entire day of market data off disk, across the wire and over to your process for analysis. KDB will not excel for this, nor will anything else. KDB shines when you learn to move your code to the data rather than your data to the code.

What does "move the code to the data" mean in practice?

You can do things like use PyKX which allows you to run your python & kdb code together on top of the data directly in the same process.

You should do as much of the filter/aggregation/joins/etc over on the KDB side before pulling the results back. You should also define, generate and use pre-aggregated data where it makes sense for your use case (second / minute / day bars).

Backtesting in KDB is relatively trivial as you have historical data organized by day and symbol. Any half decent KDB dev should be able to cook one up of increasing complexity for you.

Nick Psaris has a couple books that cover more advanced topics that may be of use.

3 comments

> you do not want a solution (in any language/tech) that involves pulling an entire day of market data off disk, across the wire and over to your process for analysis.

Honest question - why? An entire day of market data for busy option series will be in low hundreds of gigabytes with proper wire format, maybe with some compression it'd be tens of gigabytes. Even with 10 Gbit/s networking (which is kinda slow - I believe you can get at least 40 Gbit/s for Amazon EC2<->EBS) the whole day of data will be transferred in a few minutes, which means your bottleneck will be compute, not IO/network. And compute can be parallelized pretty easily.

because 10 gigabit per second networking is 204 times slower than the hbm2 memory-to-cpu interface, which is 2048 gigabits per second. that means that some computations over the whole dataset will be 204 times faster, running in a few hundred milliseconds instead of a few minutes. your question implies that no such computations exist, or at least could be of interest, but that's self-evidently false

that's assuming the data is in ram, but even a single nvme flash drive can reach 60 gigabits per second

(disclaimer, i've never used kdb, just numpy, pandas, glsl, etc.)

> your question implies that no such computations exist, or at least could be of interest, but that's self-evidently false

My questions implies the specific use case being discussed here. Backtesting is mostly about doing a lot of computations over the same data with different parameters, so you can prefetch data once and then iterate over it multiple times - the network penalty is paid only once.

my experience is that you can often compute a conservative approximation to the signals you're looking for that's valid over a range of parameters, vastly decreasing the data you have to ship across the wire
If it’s partitioned this should be even faster.
> you do not want a solution (in any language/tech) that involves pulling an entire day of market data off disk, across the wire and over to your process for analysis.

Personally I do this and just throw time / compute at the problem - mostly because I don't want to pay for KDB in $ or learning curve.

If, however, one does it that way then the actual db is largely irrelevant - if the shop uses KDB, write a query once to pull symbol & timeframe and process locally.

> What does "move the code to the data" mean in practice?

This grasps at why I’m finding KDB so hard to use. I’ve written a pricing and risk library in Rust. Historical data really needs to be taken processed in Rust rather than KDB.