| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dialtone 3680 days ago

We have an internal framework that compiles a state machine DSL into lower level code to execute queries on TrailDBs. However the binding in Python is fairly thin, and the same can be said for the other ones like the golang one. So before optimizing with a lower level language or a special DSL I would verify that Python or any other of the supported languages doesn't satisfy your requirements.

And continuous data is handled by sharding TrailDBs across some fields in our log lines, this way all the related events for a cookie in a given day belong in the same shard, each day the shard mapping is the same and we can just download the same shard ids from S3 and process the files sequentially using our DSL language. With a bit of code you can make this whole process of downloading from S3 and processing completely automated, this is in fact what we do with our data pipeline[0][1].

[0]: http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-d... [1]: http://tech.adroll.com/blog/data/2015/10/15/luigi.html