Hacker News new | ask | show | jobs
by hknmtt 893 days ago
actually i think you can also just average each chunk and then add it to existing data. like read N rows(say all have one location to keep it simple), average the data from the chunk, update/save min and max, move on to next chunk, do the same but now update the average by adding to existing/previously computed average and divide by two. the result will be the same - disk IO will be the most limiting aspect. this "challenge" is not really a challenge. there is nothing complicated about it. it just seems "cool" when you say "process 1 billion rows the fastest you can".
1 comments

Wouldn't this end up reducing the weight of earlier records by repeatedly dividing them into smaller chunks?

I.e. avg of {22.5, 23, 24} = 23.17... But:

1. 22.5

2. (22.5 + 23)/2 = 22.75

3. (22.75 + 24)/2 = 23.375

say you load 1 million records and you average them at 5.1. you then load another million and average them at 4.5. so you 5.1+4.5=9.6/2=4.8. rinse and repeat. as long as you keep the amount of records processed per each run about the same, your numbers will not be skewed. only the last chunk will most likely be smaller and it will introduce small rounding error, like if it has only 10k records instead of 1M. but still it is the simplest solution with good enough outcome.
essentially that is how integrals are calculated in mathematics if i remember correctly. you take a curve and divide it into columns, the thinner the column the smaller the deviation(because the curve has round edges so your bar will have inherent error) and you simply calculate each column and then total it and you get the body/volume of the function. same principle like radians with circle. you are merely splitting the work into smaller pieces that you can process.
you have to weight the previous results appropriately then it works