Hacker News new | ask | show | jobs
by reisse 743 days ago
> you do not want a solution (in any language/tech) that involves pulling an entire day of market data off disk, across the wire and over to your process for analysis.

Honest question - why? An entire day of market data for busy option series will be in low hundreds of gigabytes with proper wire format, maybe with some compression it'd be tens of gigabytes. Even with 10 Gbit/s networking (which is kinda slow - I believe you can get at least 40 Gbit/s for Amazon EC2<->EBS) the whole day of data will be transferred in a few minutes, which means your bottleneck will be compute, not IO/network. And compute can be parallelized pretty easily.

2 comments

because 10 gigabit per second networking is 204 times slower than the hbm2 memory-to-cpu interface, which is 2048 gigabits per second. that means that some computations over the whole dataset will be 204 times faster, running in a few hundred milliseconds instead of a few minutes. your question implies that no such computations exist, or at least could be of interest, but that's self-evidently false

that's assuming the data is in ram, but even a single nvme flash drive can reach 60 gigabits per second

(disclaimer, i've never used kdb, just numpy, pandas, glsl, etc.)

> your question implies that no such computations exist, or at least could be of interest, but that's self-evidently false

My questions implies the specific use case being discussed here. Backtesting is mostly about doing a lot of computations over the same data with different parameters, so you can prefetch data once and then iterate over it multiple times - the network penalty is paid only once.

my experience is that you can often compute a conservative approximation to the signals you're looking for that's valid over a range of parameters, vastly decreasing the data you have to ship across the wire
If it’s partitioned this should be even faster.