Hacker News new | ask | show | jobs
by tuukkah 2778 days ago
TLDR: "Why Not to Build a Time-Series Database? Because we're building one and you should pay us."

> Hopefully our story will make you think twice before trying to build your own TSDB in house using open-source solutions, or if you’re really crazy, building a TSDB from scratch. Building and maintaining a TSDB is a full time job, and we have dedicated expert engineers who are constantly improving and maintaing our TSDB, and no doubt will iterate the architecture again over time as we hit an even higher magnitude of scale down the line.

> Given our experience in this complex space, I would sincerely recommend you don’t try and do this at home, and if you have the money you should definitely outsource this to the experts who do this as a full time job, whether its Outlyer or another managed TSDB solution out there. As so many things turn out in computing, it’s harder than it looks!

5 comments

Hmmm. I used to be part of a team that handled market data at crazy rates and we took exactly the opposite approach to these guys.

When I see:

"You Can Lose a Few Datapoints Here and There"

I see that these guys are barking the wrong tree.

1. We used single thread per network card. (Yes, we architected clusters/failovers, etc... but not once was it required because of data rates)

2. The server could handle a fully saturated Gibit network at <50% CPU (per core)

3. Data was NEVER thrown away (but we had allowances in our API to let the client reading the data to drop updates and get sub-second aggregates instead -- eg OHLC or summation)

4. Data was stored in basically flat file systems.

5. Our calculation engine was run 'downstream' toward the client ends, or on the client end, away from data collection. If needed (ie. the calcs were expensive to run), these could feed back into the server for long term storage.

This was mid 2000. I'm sure this is not rocket science for modern day timeseries guys.

Yeah, it's still pretty much the same just at 10 or 40 gbit now.

Hardware capture almost never drops and timestamps with GPS sync.

You can then take those capture files and manipulate them however you want into normalized market data.

Market data has the notable feature of being segmented by trading day, so the combination of symbol-venue-date is an appropriately small unit of data to run aggregations of any kind over or to distribute over a cluster.

So for market data at least, there's not much to "rolling your own" time series DB in Python or what-have-you.

Prcessing that firehouse in real time for trading is a different matter though and how you build that depends heavily on your latency requirements.

Right. For those interested OpenHFT has created a really nice set of open source solutions to do this.

https://github.com/OpenHFT/Chronicle-Queue#design

Do you know any article or book outlining the architecture of a full HFT system, I.e. from market data consumption to pricing to trading? Thanks in advance!
That's how I read it too. To people who haven't worked with metrics at scale though there is some good information and it's worth reading.

It blows my mind that businesses are willing to outsource metrics. When I worked at Amazon it was trivial to estimate the next quarter's results from the app metrics. Naturally this meant we were subject to trading restrictions.

If a monitoring company ever starts applying Google/Facebook style ethics with regards to exploiting the data their customers give them, they will be in an incredibly powerful position.

> It blows my mind that businesses are willing to outsource metrics.

It makes sense at various scales compared to hiring, training, maintaining infrastructure, handling incidents, etc. related to your own metrics solution.

When I'm interviewing a database expert, the one that says:

> "You Can Lose a Few Datapoints Here and There"

is not the one I'm going with...

There are some properties of the data that can be exploited to add weaker consistency guarantees. This leads to some desirable design trade-offs in terms of simplicity and performance optimisation. While this could result in data loss, it may be permissible given that queries can span large time ranges where one or two missing datapoints do not carry the same weight as a financial miscalculation, or loss of life. The same could be said with multiplayer games played over mobile devices, with intermittent connectivity issues. In this domain, the player's moves are fast forwarded once connectivity is restored, as this provides no observable difference to other players. My point is that it's very dependent on the use case, and does not apply across the board.
There's nothing wrong with a special-purpose tool for building approximate graphs, but calling it a "time-series database" or even quoting "inserts-per-second" is intellectually dishonest.
Many SSDs only write 4kb blocks, and writing a 64bit datapoint uncompressed to disk would not only be slow, but it would result in write amplification and wear out the disk sooner. The solution that many TSDBs, including Prometheus and Influx, involves in-memory batching with a backing WAL log file. If the in-memory batch or WAL log is lost, you would lose data as well.
You shouldn't be the one hiring if you can't talk about different scenarios.
"Don't do it because it's hard! And you should listen to us because we have the meta-knowledge and experience (now) to know everything there is to know about this topic, plus the bravery to admit in public that we are only human and we make mistakes."

Isn't this the mantra of all of these types of articles?

OK, yes, it usually makes sense. Especially in the case where you are like these guys and experienced enough in some relevant area to know just how difficult it can be. These are perfectly good reasons from technical, business, and project planning perspectives.

Isn't there a TLDR where they mentioned when it does make sense to build your own TSBD? Presumably in some case where you have a team of serious, high-grade experts who know exactly what they are doing; have requirements that cannot be met by any of the other offerings out there; and where the whole thing has been specced out and deemed reasonable?

The premise itself is also quite funny, imo. I think very few people on this planet would think "I need a database so I build one myself". They might do this with the application layer, but most people consider databases black boxes that they interact with through SQL, or maybe not even that. Maybe they just use an abstraction framework in their favorite languages that lets them write objects which have .load() and .save() methods that generate SQL by themselves.