Hacker News new | ask | show | jobs
by philsnow 1264 days ago
There are conduit actions (like dedup, sort) that require all the previous action's output to be ready and available before they can start producing any output. All the discussion of GC-friendliness of caches makes me think that there aren't any conduits that have more data than can fit in memory on a single machine.

If you had conduits that were too big to fit in memory at once, would you (channable) stream them to local disk (either explicitly or just using virtual memory)? Or would you be able to distribute work between multiple machines with a cluster-aware Conduit type?

Your scheduler could split the job up into multiple machines and run the same Conduit pipeline on all the machines, and only the conduit steps that need to communicate with each other would do so.

Separate question, do you have any actions that produce more output than their input? I could imagine some customers might find it useful to generate the cartesian product of two inputs, or the power set of one input.

3 comments

> All the discussion of GC-friendliness of caches makes me think that there aren't any conduits that have more data than can fit in memory on a single machine.

Yep! That's currently still the case. We do have some ideas to put caches on disk using mmapped files, so that you have fast access when it all fits in memory but the OS can also drop them when it wants to.

For the moment we just use instances with 128GB memory, and those can still fit the datasets of the biggest customers that we have right now. Datasets go up to 10s of GBs, so it's not really 'big data' that needs to be distributed across machines. Due to the regular occurrence of aggregations (sort/group/window/deduplicate) that require at least _some_ synchronization, we only have small sections that can be completely parallelized. It's already a challenge to use all the cores on a single machine in an efficient manner, and I don't think we'll achieve much by using multiple machines for a single job. We've discussed a lot of this in an earlier blog post here: https://www.channable.com/tech/why-we-decided-to-go-for-the-...

> Separate question, do you have any actions that produce more output than their input?

Yep, we don't have cartesian products but we do have a 'split' action. Typical usage might be that a customer has a "sizes" field with values like "M,L,XL" and that they would split on that field so that a single item becomes three items for the separate sizes. The increase in the number of items is usually limited, and the increase in memory usage is even smaller because at most points during the data processing we only store the changed fields, and refer to the original item for the rest. In these cases multiple items will point to the same original item.

> makes me think that there aren't any conduits that have more data than can fit in memory on a single machine

Just because you can dedup a conduit doesn't mean you have to. We use conduits for streaming gigabytes-large files while staying within megabytes of memory use – ensuring that is one of the main selling points of libraries like conduit: https://github.com/snoyberg/conduit#readme

found https://utdemir.com/posts/ann-distributed-dataset.html , which makes me think that kind of distributed approach could be workable for channable (the default existing backend spins up lambda workers and stores intermediate results in S3), but you'd need to write a different backend that farms out work to whatever worker machines you have available
The distributed backend part could be done via a lower-level set of distributed primitives from https://hackage.haskell.org/package/distributed-process