| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vlahmot 3081 days ago

I don't have any specific experience with emails but any time I need to move a lot of data around I go with Apache Kafka and Apache Flume.

Write all of your emails into a kafka topic from your webapp. Read from the topic to do processing. Use flume to sync results back to your webapp db.

1) For this I would probably use something like Chef/Ansible but I don't know the first thing about configuring email servers. You could have something that wakes up, reads the latest config off a topic, and then applies that config via a config management tool.

2) You can throw Apache Spark on to the kafka stream to calculate these aggregations.

3) Flume can read the emails and then save them back to wherever you need (this is typically s3/postgres for me). Flume can scale out over the kafka topic naturally using the same consumer group id.

I like this approach because you can scale it cheaply and easily by starting with kinesis streams instead of kafka if you don't have the ops resources to run kafka and running spark in stand alone mode until you need a cluster.

With spark you can do your statistics in there (streaming over a time window or batch) and then sink them over to your stats db.

With the flume/kafka combo you can treat kafka as the "channel" and you get some nice transaction functionality out of flume that makes handling failures a breeze.

It does take some tooling/monitoring to run confidently and the whole apache "big data" ecosystem is daunting at first but its well worth it in my opinion.

1 comments

herbst 3081 days ago

This sounds super interesting. I never worked with any of these, but this sounds more or less like what i need. Kudos

link