|
|
|
|
|
by yoricksijsling
1262 days ago
|
|
> All the discussion of GC-friendliness of caches makes me think that there aren't any conduits that have more data than can fit in memory on a single machine. Yep! That's currently still the case. We do have some ideas to put caches on disk using mmapped files, so that you have fast access when it all fits in memory but the OS can also drop them when it wants to. For the moment we just use instances with 128GB memory, and those can still fit the datasets of the biggest customers that we have right now. Datasets go up to 10s of GBs, so it's not really 'big data' that needs to be distributed across machines. Due to the regular occurrence of aggregations (sort/group/window/deduplicate) that require at least _some_ synchronization, we only have small sections that can be completely parallelized. It's already a challenge to use all the cores on a single machine in an efficient manner, and I don't think we'll achieve much by using multiple machines for a single job. We've discussed a lot of this in an earlier blog post here: https://www.channable.com/tech/why-we-decided-to-go-for-the-... > Separate question, do you have any actions that produce more output than their input? Yep, we don't have cartesian products but we do have a 'split' action. Typical usage might be that a customer has a "sizes" field with values like "M,L,XL" and that they would split on that field so that a single item becomes three items for the separate sizes.
The increase in the number of items is usually limited, and the increase in memory usage is even smaller because at most points during the data processing we only store the changed fields, and refer to the original item for the rest. In these cases multiple items will point to the same original item. |
|