Hacker News new | ask | show | jobs
by potatote 4131 days ago
Can someone explain why one can't simply average the individual average results as the author wrote below:

"" No, we can't run averages on worker nodes, and then average those out. We need to have each worker node compute their sum(order_value) and count(order_value), and then sum(sum()) / sum(count()) on the coordinator node. ""?

Thank you.

4 comments

It's not just averages, it's division in general.

Division is not commutative, as the article says. A simple example referring to the article's diagram of boxes:

orders_2013 has sum(price) = 10, with 3 records

orders_2014 has sum(price) = 11, with 5 records

orders_2015 has sum(price) = 31, with 7 records

Average on each node, and average them:

( (10/3)+(11/5)+(31/7) ) / 3 = 3.32063492063

Sum the price individually on each node, take the counts on each node, sum them on the master node, and divide on the master node:

(10+11+31)/(3+5+7) = (10+11+31)/15 = 3.46666666667

hence, running division on each node is not the same as finding the division across all orders. (replace my use of division with "average" and it's the same concept).

Set 1 (5,4,3) = 4 average

Set 2 (5,7) = 6 average

Average of average (4,6) = 5

Average of Set 1 + Set 2 (5,4,3,5,7) = 4.8

because (2+3+4)/3 != (2+3)/2 + 4/1