| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by paladin314159 3954 days ago
	Author of the post here. Happy to talk about how we've designed/built our architecture at Amplitude!

3 comments

tinco 3954 days ago

How did you guys split the databases per customer? Is it all one big stream of data for you or does it get split at a pretty early level? Is data of multiple customers in every database or do you maintain a cluster per customer?

link

paladin314159 3954 days ago

Most of our databases are multi-tenant, so a single cluster will handle all customer data. The exception is Redshift, which has a separate cluster for each customer since we allow them to have direct access to it (https://amplitude.com/blog/2015/06/05/optimizing-redshift-pe...).

link

ddorian43 3954 days ago

You could store the sets in postgresql arrays(to remove row overhead) (1GB maximum field size) and build some efficient union,intersect functions so you wouldn't have to unnest?

link

paladin314159 3954 days ago

We tried a variety of PostgreSQL-based approaches, including this one. Unfortunately, the way you do set insertions using arrays is to do a O(n) membership check, which means your set will take O(n^2) to construct -- very inefficient.

link

optimusclimb 3954 days ago

Did you use Camus for ETL, and if so, did you have to modify it to work with S3?

link

paladin314159 3954 days ago

We don't use Camus; IIRC, it didn't exist at the time that we built most of the infrastructure. We just read data directly out of Kafka using client libraries.

link

optimusclimb 3952 days ago

Apologies for delayed response, I'm guessing you won't see it, but...without Camus in place...did you do anything to ensure exactly-once semantics in moving the data to the Batch Layer?

For the real time layer I see it as not being mission critical for most data sets to be 100% correct, but for the ETL part of the process, the guarantees provided by Camus (ensured by the OutputCommitters part of MR I believe) are invaluable.

link