|
|
|
|
|
by jimmyed
894 days ago
|
|
I think the optimal strategy would be to use the "reduce" step in mapreduce. Have threads that read portions of the file and add data to a "list", 1 for each unique name. Then, this set of threads can "process" these lists. I don't think we need to sort, that'd be too expensive, just a linear pass would be good. I can't see how we can do SIMD since we want max/min which mandate a linear pass anyway. |
|
Would have been more interesting with something like median/k-th percentile, or some other aggregation not as easy.