Hacker News new | ask | show | jobs
by cdoxsey 2777 days ago
Granted its an exceptional case, but the query loads we saw at Datadog were poorly served by off-the-shelf solutions.

Maybe things are different now, but I doubt it. You can spend a fortune to get good performance, or you can deal with slow performance, or you can invest a lot of engineering effort and get both, but there's not a ready-to-use solution that will magically replace an entire engineering team for real scale.

But almost no one ever hits that scale, so maybe it's better to adopt this line as a rule of thumb anyway...

1 comments

I think Clickhouse would do well but I've seen other metrics/observability vendors (like Honeycomb) also build their own systems given the scale and cost factors.

Isn't Datadog on AWS? If you have very specific needs and can build a vertical infrastructure stack then it makes perfect sense to build your own.

Yeah AWS, though mostly just EC2.

I think the challenge is that there are multiple competing needs which are in tension. Data isn't uniform, a large write load that's almost never queried, recent data is accessed way more often than older data, flexible tagging means (org,metric) queries produce potentially millions of points (imagine disk usage across every node for every disk), but indexing tags can be very costly, and its difficult to predict what someone is going to want to query.

I agree that hyper-focus on those needs can distort the picture though. You don't actually have to solve them most of the time, and a relatively poorly optimized solution goes a lot further than people realize. Simply adding caching, for example, solves almost all these issues.

Anyway I mostly agreed with your opening comment.