| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jimmyed 894 days ago
	I think the optimal strategy would be to use the "reduce" step in mapreduce. Have threads that read portions of the file and add data to a "list", 1 for each unique name. Then, this set of threads can "process" these lists. I don't think we need to sort, that'd be too expensive, just a linear pass would be good. I can't see how we can do SIMD since we want max/min which mandate a linear pass anyway.

2 comments

Agreed, the aggregations chosen here are embarrassingly parallel, you just keep the count to aggregate means.

Would have been more interesting with something like median/k-th percentile, or some other aggregation not as easy.

Not sure if this what you meant, but there are SIMD min/max instructions.