| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by WWLink 3480 days ago

No. I think the ideal use case here is you use JSON over Kafka, and store the data in Avro files. The avro files have the schema at the start.

I wonder what they're using to retrieve that data for analysis later on. I do something very similar to this, but having to sift through millions of messages for a given time period, to find a subset of said messages is kinda annoying.

It's a good thing they didn't use Confleunt Camus. -shudder- It supports Avro-Over-Kafka out of the box, on the caveat that every single time it reads a message off kafka, it pings the schema registry to get the schema for it. That's great and all, until you've got thousands of messages per second.

2 comments

lacksconfidence 3480 days ago

> I wonder what they're using to retrieve that data for analysis later on. I do something very similar to this, but having to sift through millions of messages for a given time period, to find a subset of said messages is kinda annoying.

It looks like hive or spark, depending on the use case. The data is also loaded into Druid when looking at statistics, rather than getting full data about individual messages.

> It's a good thing they didn't use Confleunt Camus. -shudder- It supports Avro-Over-Kafka out of the box, on the caveat that every single time it reads a message off kafka, it pings the schema registry to get the schema for it. That's great and all, until you've got thousands of messages per second.

They are using camus, much of the post is dedicated to it. It looks like they are also running avro over kafka+camus for some application logging, but at a lower volume (~10k messages/sec peak)

link

WWLink 3479 days ago

iirc, confluent has their own version of kafka/camus that uses a schema registry where the first few bytes of the kafka messages identify the schema.

The wikimedia article sounds like they're just using regular camus and their own interpreter. that would perform a bit better :) Still wonder why they didn't just write a spark job to do the same thing.

link

koolba 3480 days ago

Why would they need to hit the registry for every message? Wouldn't the schemas be immutable and thus able to be (at least temporarily) cached? They might have millions of messages but it's doubtful they have millions of message schemas.

link

vlahmot 3480 days ago

The schemas are not immutable. You also don't hit the schema registry for every message either, in fact you can skip the registry all together and provide the schema manually if you would like.

link

lacksconfidence 3480 days ago

You could provide them manually, but then any schema upgrade becomes a big pain. Wikimedia, as one example, uses versioned schemas. As such each version is immutable and can be pulled from the cache. Each kafka message has a null byte, and then a long version number prefixed to indicate how it should be decoded.

https://github.com/wikimedia/analytics-refinery-source/blob/...

link