Hacker News new | ask | show | jobs
by alakin 2799 days ago
Is most of the intermediate processing done in memory, or is it limited by hd write speed?
1 comments

I'm Derek, one of the co-founders--excellent question!

The former. PipelineDB performs aggregations in memory on microbatches of events, and only merges the aggregate output of each microbatch with what's on disk. This is really the core idea behind why PipelineDB is so performant for continuous time-series aggregation. Microbatch size is configurable: http://docs.pipelinedb.com/conf.html.

Can you say a bit more about "performant" or point me to some information? I haven't found any yet. I'm processing millions of protobufs per second and would love to get away from batch jobs to do some incredibly basic counting -- this seems like a fit conceptually...If its a fit, any recommendations on the best way to get those protobufs off a kafka stream and into pipelinedb would be great, too!
Performance depends heavily on the complexity of your continuous queries, which is why we don't really publish benchmarks. PipelineDB is different from more traditional systems in that not all writes are all created equal, given that continuous queries are applied to them as they're received. This makes generic benchmarking less useful, so we always encourage users to roughly benchmark their workloads to really understand performance.

That being said, millions of events per second should absolutely be doable, especially if your continuous queries are relatively straightforward as you've suggested. If the output of your continuous queries fits in memory, then it's extremely likely you'd be able to achieve the throughput you need relatively easily.

Many of our users use our Kafka connector [0] to consume messages into PipelineDB, although given that you're using protobufs I'm guessing your messages require a bit more processing/unpacking to get them into a format that can be written to PipelineDB (basically something you can INSERT or COPY into a stream). In that case what most users do is write a consumer that simply transforms messages into INSERT or COPY statements. These writes can be parallelized heavily and are primarily limited by CPU capacity.

Please feel free to reach out to me (I'm Derek) if you'd like to discuss your workload and use case further, or set up a proof-of-concept--we're always happy to help!

[0] https://github.com/pipelinedb/pipeline_kafka

That's awesome! If you don't mind - one more q.. I see that stream-stream joins are not yet supported (http://docs.pipelinedb.com/joins.html#stream-stream-joins). Can you comment on when you think this feature cold land or is it still a ways off?
Sure! So stream-stream JOINs actually haven't been requested by users as much as you'd think. Users have generally been able to get what they need by using topologies of transforms [0], output streams, and stream-table JOINs. Continuous queries can be chained together into arbitrary DAGs of computation, which turns out to be a very powerful concept when mapping out a path from raw input events to the desired output for your use case.

The primary issue in implementing stream-stream JOINs is that we'd essentially need to preemptively store every single raw event that could be matched on at some point in the future. Conceptually this is straightforward, but on a technical level we just haven't seen the demand to optimize for it.

That being said, you could just use a regular table as one of the "streams" you wanted to JOIN on and then use an stream-table JOIN. As long as the table side of the JOIN is indexed on the JOIN condition, an STJ would probably be performant enough for a lot of use cases. With PostgreSQL's increasingly excellent partitioning support this is becoming especially practical.

I also suspect that this is an area where integration with TimescaleDB could be really interesting!

[0] http://docs.pipelinedb.com/continuous-transforms.html

Just out of curiosity, do you have a specific use case that necessitates stream-stream JOINs, or were you just exploring the docs and wondering about this?
My use case is pretty much parallel time series alignment with several layers of aggregation. I guess I perceive stream-stream joins as an easy way for me to wrap my head around how to structure my compute graph, but it seems doable with the method mentioned by @grammr. I'd hope for an interface roughly like "CREATE join_stream from (SELECT slow_str.key AS key, sum(slow_str.val, fast_str.val) AS val FROM slow_str, fast_str INNER JOIN ON slow_str.key = fast_str.key)". I do realize there are some tough design decisions for a system like this, but I'd also like to drop my wacky zmq infrastructure ;)