| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bb01100100 3420 days ago

I'm a bit late to the party on this post, but a lot has changed since 2015, too:

- Kafka writes its topic offsets into, er, Kafka, rather than Zookeeper; which means ZK load is much lighter. I'd be interested in knowing more about why people find ZK heavy weight - I've found it to be light, unobtrusive and very robust. I also use it for managing state (distributed locks) as brokers boot up and obtain a consistent Broker ID from a defined list of available 'slots' (e.g. if broker 2 goes down, I want the replacement node to also boot up and be assigned broker ID 2; if two nodes go down, I want those free slots to be allocated to the next two nodes that boot up).

- Kafka v0.10 released the Streams API, which has many similarities to Samza, but removes the requirement to run another cluster for your streaming applications (no YARN requirement - phew! Kafka will load balance your consumer-group streaming-app; kill/restart/boot as you see fit). Consume data as streams and build virtual tables (persisted to disk via RocksDB for get/put operations, written into Kafka for availability/persistence); join stream->stream, stream->table, filter, map, aggregate via time windows, write your own processor/transform classes.. it's very flexible without always needing to dive into the lower level Processor API.

- Kafka v0.10.[01] also included an Interactive Queries API - what you get when you realise that your stream apps have created intermediate aggregates/tables of information that would be so handy to query (by other apps, or internally, from across the cluster). I haven't used this API yet, but anticipate doing so in 2017.

Samza to me always felt like it was in tech preview - not comprehensively documented and the only real source of knowledge was reading the source and asking questions in the mailing lists (of which there were excellent and high quality answers).

What other architectures are people using for streaming data pipelines/apps (things that do more than count clicks, i.e. transforms, enrichment, etc) - i.e. lowish latency (event to output in less than, say, five seconds) where relational databases aren't a natural solution?

I'm leveraging Kafka at my day job as a distributed data backbone (company-wide focus - think application logs, app usage logs, usage-derived billing, subscriptions, master data, etc), which when coupled with Streams (new breed of application at this company) gives us our first steps towards event-driven processing and data delivery as a stream of new information, rather than periodic batches.

This provides us with an alternative to the API+database model we've used for more than a decade and which works very well, but there are times when polling isn't the answer. Decoupling producers of data from consumers has subtle but substantial benefits (data liberation, for one, removing/reducing point-to-point integrations for another).