Hacker News new | ask | show | jobs
Husky, Datadog's Third-Generation Event Store (datadoghq.com)
191 points by louis-paul 1497 days ago
7 comments

Lovely read. Condensing some, there's three node types in the system, writers, compactors, and readers.

> Writers read from Kafka, (briefly) buffer events in memory, upload events to blob storage in our custom file format, and then commit the presence of these new files to our metadata store.... Compactors scan the metadata store for small files generated by the Writers and previous compactions, and compact them into larger files.... The Reader (leaf) nodes run queries over individual files in blob storage and return partial aggregates, which are re-aggregated by the distributed query engine.

And then the meta-data supporting the system:

> Husky's metadata store has multiple responsibilities, but its most important one is to serve as the strongly consistent source of truth for the set of files currently visible to each customer. We’ll delve into the details of our metadata store more in future blog posts, but it is a thin abstraction around FoundationDB, which we selected because it was one of the few open source OLTP database systems that met our requirements

There's some nice scalability/isolation benefits in this all. Having reader nodes reading from network storage has created a lot of flexibility & ability to shift work around on demand.

Keeping all the metadata in FoundationFB is exciting, & sounds like a great use case, for it's safe transactional updates!

Also, using external compactors give another independent scaling dimension. Nice
It's remarkable how the data pipeline in almost all companies converge to the same architecture:

* You have services emit data into streams.

* You dump the streams into your storage with high frequency so you can have near real-time result, this process will create many small files.

* Because small files are inefficient, you have compactors that run over the small files and merge them into bigger files, and/or delete records that's obsolete.

* You run a query engine that read over the small files and large files to get the final result.

* To speed up step 2,3,4 you store the metadata of the files in-memory / in a database.

Nice article indeed, we ended up implementing the exact same architecture at Quickwit for... log search! :)

https://twitter.com/fulmicoton/status/1526776987553263616 https://github.com/quickwit-oss/quickwit

This is a great read, thanks for sharing the architecture. I am glad to see the increase in adoption of FoundationDB. It is a great piece of technology why is also why we are using it as a core component for Tigris https://docs.tigrisdata.com/overview/key-concepts
Has Datadog come up with a new generation of sales approaches? I (and many others, according to the discussion when the topic comes up) have had bad experiences.
Had bad experiences as well. - Pushy - Trying to sell you stuff even if you explicitly mention you're only interested in one specific service multiple times - Don't tailor the sales process at all to your needs

its a shame, the product is kind of nice. But this is 100% of putting.

same here

one mistake in my logs, and my account was due > 10k us$. until a manager contact-me after a month. It appears to be a method to force a "sales" call.

A simple indicator of how much you are due ( daily ) would solve this kind of problem. ( google/reddit shows that this kind of problem happens all the time in the last 2 years )

Well in their last earnings call they did boast incredible sales efficiency.
Can you be more specific? We’re evaluating them now and it’s been fine.
We have instances that spin up and down quickly. AWS bills by the second; Datadog billed at that time (unsure if it's changed) by the minute. This mismatch led to huge bills, such that monitoring was more expensive than the resource being monitored. It's probably fair to respond to that with RTFM. However, par for the course in the industry seems to be to adjust the bill when our mistake was made in good faith. Their response was to give us a small adjustment in exchange for signing up for additional services. More than just what happened is how it felt. It felt sleazy, and didn't jibe with the way the company was presented in the community.

As for the tech, it seemed like a quality product.

Not confined to DataDog, but I always feel like I’m doing something wrong if my o11y costs are more than my infra costs.
And that resulting in compromises such as high amounts of sampling, ignoring a facet of data altogether because "we won't need it if something goes wrong...", and/or using some kind of log stream processor to divert large amounts of data in S3 instead of allowing it be queried whenever you want.
Nice read, but I was hoping they’d say that it led to a big improvement in their log searching syntax/ui. It seems impossible to just full text search for a string and find log lines that have a value containing that text. Drilling down through the “details” pane and clicking filter/match/exclude works well, but general searching is too confusing for me to figure out, if it even works at all.
One could argue that all they did was move most of the complicated logic into the blob store. Not that it's a bad thing.