Hacker News new | ask | show | jobs
by paladin314159 3954 days ago
Author of the post here. Happy to talk about how we've designed/built our architecture at Amplitude!
3 comments

How did you guys split the databases per customer? Is it all one big stream of data for you or does it get split at a pretty early level? Is data of multiple customers in every database or do you maintain a cluster per customer?
Most of our databases are multi-tenant, so a single cluster will handle all customer data. The exception is Redshift, which has a separate cluster for each customer since we allow them to have direct access to it (https://amplitude.com/blog/2015/06/05/optimizing-redshift-pe...).
You could store the sets in postgresql arrays(to remove row overhead) (1GB maximum field size) and build some efficient union,intersect functions so you wouldn't have to unnest?
We tried a variety of PostgreSQL-based approaches, including this one. Unfortunately, the way you do set insertions using arrays is to do a O(n) membership check, which means your set will take O(n^2) to construct -- very inefficient.
Did you use Camus for ETL, and if so, did you have to modify it to work with S3?
We don't use Camus; IIRC, it didn't exist at the time that we built most of the infrastructure. We just read data directly out of Kafka using client libraries.
Apologies for delayed response, I'm guessing you won't see it, but...without Camus in place...did you do anything to ensure exactly-once semantics in moving the data to the Batch Layer?

For the real time layer I see it as not being mission critical for most data sets to be 100% correct, but for the ETL part of the process, the guarantees provided by Camus (ensured by the OutputCommitters part of MR I believe) are invaluable.