| Having a possibility to update (query) output with new input data rather than process the whole input again even if the changes are very small is indeed a very useful feature. Assume that you have one huge input table and you computed the result consisting of a few rows. Now you add 1 record to the input. A traditional data processing system will again process all the input records while the differential system will update the existing output result. There are the following difficulties in implementing such systems: o (Small) changes in input have to be incrementally propagated to the output as updates rather than new results. This changes the paradigm of data processing because now any new operator has to be "update-aware" o Only simple operators can be easily implemented as "update-aware". For more complex operators like aggregation or rolling aggregations, it is frequently not clear how it can be done conceptually (efficiently) o Differential updates have to be propagated through a graph of operations (topology) which makes the task more difficult. o Currently popular data processing approaches (SQL or map-reduce) were not designed for such a scenario so some adaptation might be needed Another system where such an approach was implemented, called incremental evaluation, is Lambdo: https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last! Yet, this Python library relies on a different novel data processing paradigm where operations are applied to columns. Mathematically, it uses two types of operations: set operations and functions operations, as opposed to traditional approaches based on only set operations. A new implementation is here: https://github.com/asavinov/prosto - Functions matter! No join-groupby, No map-reduce. Yet, currently incremental evaluation is implemented only for simple operations (calculated columns). |