| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hknmtt 897 days ago

first search result for averaging streaming data:

https://nestedsoftware.com/2018/03/20/calculating-a-moving-a...

so you just walk the file and read a chunk, update the averages and move on. the resource usage should be 0.000 nothing and speed should be limited by your disk IO.

3 comments

winrid 897 days ago

Yes, just read a file in chunks and spread the math across cores. How many ways could you possibly implement that?? :)

link

cm2187 897 days ago

Custom number parsing, minimising the number of memory allocations to not be punished by the garbage collector. All sort of micro optimisations that make those solutions a terrible way to showcase a language (i.e. you can write much clearer and concise code but obviously slower).

link

winrid 896 days ago

I agree that the simplest solution in each language is the best way to compare - however this problem seems less about showing off java and more about challenging folks.

link

cm2187 897 days ago

Looking at some solutions they seem to include their own double parsing implementation. I built a home made serializer for csv files, I am using the default .net parsing functions and I find that parsing numbers/dates is by far the slowest part of the process on large files.

link

hknmtt 897 days ago

actually i think you can also just average each chunk and then add it to existing data. like read N rows(say all have one location to keep it simple), average the data from the chunk, update/save min and max, move on to next chunk, do the same but now update the average by adding to existing/previously computed average and divide by two. the result will be the same - disk IO will be the most limiting aspect. this "challenge" is not really a challenge. there is nothing complicated about it. it just seems "cool" when you say "process 1 billion rows the fastest you can".

link

sudodudeo 897 days ago

Wouldn't this end up reducing the weight of earlier records by repeatedly dividing them into smaller chunks?

I.e. avg of {22.5, 23, 24} = 23.17... But:

1. 22.5

2. (22.5 + 23)/2 = 22.75

3. (22.75 + 24)/2 = 23.375

link

hknmtt 897 days ago

say you load 1 million records and you average them at 5.1. you then load another million and average them at 4.5. so you 5.1+4.5=9.6/2=4.8. rinse and repeat. as long as you keep the amount of records processed per each run about the same, your numbers will not be skewed. only the last chunk will most likely be smaller and it will introduce small rounding error, like if it has only 10k records instead of 1M. but still it is the simplest solution with good enough outcome.

link

hknmtt 896 days ago

essentially that is how integrals are calculated in mathematics if i remember correctly. you take a curve and divide it into columns, the thinner the column the smaller the deviation(because the curve has round edges so your bar will have inherent error) and you simply calculate each column and then total it and you get the body/volume of the function. same principle like radians with circle. you are merely splitting the work into smaller pieces that you can process.

link

jdthedisciple 897 days ago

you have to weight the previous results appropriately then it works

link