It looks like TrailDB solves a particular problem (which is actually quite common btw) the right way. However the language barrier makes it hard to use it as DB. I believe that Python binding is nice for development environment but in production we would probably want to use a low level language for performance reasons because we'll be dealing with billions of events in a single machine. Do you plan to implement a query language which compiles to native code such as SQL?
How do you handle continuous data with TrailDB? It seems that you store raw events Kinesis, buffer events and periodically write TrailDB files to S3. When you want to process events for a specific user, normally the events might be in a random TrailDB file so the timeline would be mixed up. Do you merge TrailDB files in a single instance when processing the data or have a sharding mechanism?
We have an internal framework that compiles a state machine DSL into lower level code to execute queries on TrailDBs. However the binding in Python is fairly thin, and the same can be said for the other ones like the golang one. So before optimizing with a lower level language or a special DSL I would verify that Python or any other of the supported languages doesn't satisfy your requirements.
And continuous data is handled by sharding TrailDBs across some fields in our log lines, this way all the related events for a cookie in a given day belong in the same shard, each day the shard mapping is the same and we can just download the same shard ids from S3 and process the files sequentially using our DSL language. With a bit of code you can make this whole process of downloading from S3 and processing completely automated, this is in fact what we do with our data pipeline[0][1].
It sounds like TrailDB works particularly well with storing data into S3, so what do you use to buffer incoming data before it is written in TrailDB? Kinesis?
How do you handle continuous data with TrailDB? It seems that you store raw events Kinesis, buffer events and periodically write TrailDB files to S3. When you want to process events for a specific user, normally the events might be in a random TrailDB file so the timeline would be mixed up. Do you merge TrailDB files in a single instance when processing the data or have a sharding mechanism?