Hacker News new | ask | show | jobs
by jflatow 3675 days ago
I'm so glad AdRoll finally open-sourced this. TrailDB is by far the easiest way to process trillions of events on a single machine. Can't wait for them to start publishing their internal ecosystem around it too.
1 comments

Thanks! I hope we can open-source the rest of the stack now that the core libraries are out.

Btw, I am happy to answer any questions about the project here.

It looks like TrailDB solves a particular problem (which is actually quite common btw) the right way. However the language barrier makes it hard to use it as DB. I believe that Python binding is nice for development environment but in production we would probably want to use a low level language for performance reasons because we'll be dealing with billions of events in a single machine. Do you plan to implement a query language which compiles to native code such as SQL?

How do you handle continuous data with TrailDB? It seems that you store raw events Kinesis, buffer events and periodically write TrailDB files to S3. When you want to process events for a specific user, normally the events might be in a random TrailDB file so the timeline would be mixed up. Do you merge TrailDB files in a single instance when processing the data or have a sharding mechanism?

We have an internal framework that compiles a state machine DSL into lower level code to execute queries on TrailDBs. However the binding in Python is fairly thin, and the same can be said for the other ones like the golang one. So before optimizing with a lower level language or a special DSL I would verify that Python or any other of the supported languages doesn't satisfy your requirements.

And continuous data is handled by sharding TrailDBs across some fields in our log lines, this way all the related events for a cookie in a given day belong in the same shard, each day the shard mapping is the same and we can just download the same shard ids from S3 and process the files sequentially using our DSL language. With a bit of code you can make this whole process of downloading from S3 and processing completely automated, this is in fact what we do with our data pipeline[0][1].

[0]: http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-d... [1]: http://tech.adroll.com/blog/data/2015/10/15/luigi.html

It sounds like TrailDB works particularly well with storing data into S3, so what do you use to buffer incoming data before it is written in TrailDB? Kinesis?
Our infrastructure is in AWS, so S3 was a natural choice for us. You can easily use any other object store or filesystem with TrailDB.

We use Kinesis amongst other things to stream raw data to S3.