Hacker News new | ask | show | jobs
by schmatz 3844 days ago
Sure! We batch together data to load into warehouses by time and a few other properties. Usually, to figure out what objects are in S3, you have to issue an S3 list objects command. That operation tends to be relatively slow, especially if there are many objects.

Instead, when we put a new object, we update a table in Aurora which tracks all of the relevant objects. That way, we can query information like "what objects were uploaded in a certain time range" very quickly.

1 comments

How do you ensure that Aurora and S3 stay in sync?
We have a worker consuming S3 events and updating the index. On a related note, we have experimented with Lambda to do something similar; AWS has done a fantastic job integrating their products :)
Cool, that part makes sense, but how are you ensuring they stay in sync?

That is, what happens if a db write fails? How are you handling concurrent updates? Is there a reconcile process that runs periodically?

(those details would be an awesome engineering blog post :) )

Edited to add:

Any reason that AWS Lambda didn't work out? Was it due to the public endpoint requirements?

Double Edited to add:

I totally geek out on composing AWS Services, and this is fascinating to me.

This is a good idea for a blog post! The way we have set up the system ensures that we requeue upon failure and concurrent updates are not an issue.

We are eagerly waiting for Lambda VPC support!

I'm happy you're interested in this sort of stuff; shoot me an email, let's chat :) michael@segment.com