Hacker News new | ask | show | jobs
by lacksconfidence 3432 days ago
> I wonder what they're using to retrieve that data for analysis later on. I do something very similar to this, but having to sift through millions of messages for a given time period, to find a subset of said messages is kinda annoying.

It looks like hive or spark, depending on the use case. The data is also loaded into Druid when looking at statistics, rather than getting full data about individual messages.

> It's a good thing they didn't use Confleunt Camus. -shudder- It supports Avro-Over-Kafka out of the box, on the caveat that every single time it reads a message off kafka, it pings the schema registry to get the schema for it. That's great and all, until you've got thousands of messages per second.

They are using camus, much of the post is dedicated to it. It looks like they are also running avro over kafka+camus for some application logging, but at a lower volume (~10k messages/sec peak)

1 comments

iirc, confluent has their own version of kafka/camus that uses a schema registry where the first few bytes of the kafka messages identify the schema.

The wikimedia article sounds like they're just using regular camus and their own interpreter. that would perform a bit better :) Still wonder why they didn't just write a spark job to do the same thing.