| Hmmm. I used to be part of a team that handled market data at crazy rates and we took exactly the opposite approach to these guys. When I see: "You Can Lose a Few Datapoints Here and There" I see that these guys are barking the wrong tree. 1. We used single thread per network card. (Yes, we architected clusters/failovers, etc... but not once was it required because of data rates) 2. The server could handle a fully saturated Gibit network at <50% CPU (per core) 3. Data was NEVER thrown away (but we had allowances in our API to let the client reading the data to drop updates and get sub-second aggregates instead -- eg OHLC or summation) 4. Data was stored in basically flat file systems. 5. Our calculation engine was run 'downstream' toward the client ends, or on the client end, away from data collection. If needed (ie. the calcs were expensive to run), these could feed back into the server for long term storage. This was mid 2000. I'm sure this is not rocket science for modern day timeseries guys. |
Hardware capture almost never drops and timestamps with GPS sync.
You can then take those capture files and manipulate them however you want into normalized market data.
Market data has the notable feature of being segmented by trading day, so the combination of symbol-venue-date is an appropriately small unit of data to run aggregations of any kind over or to distribute over a cluster.
So for market data at least, there's not much to "rolling your own" time series DB in Python or what-have-you.
Prcessing that firehouse in real time for trading is a different matter though and how you build that depends heavily on your latency requirements.