|
|
|
|
|
by saltcured
237 days ago
|
|
Back in the day, optimizing this would be about parallel IO and some map-reduce processing. Data sharded on a bunch of nodes, each effectively doing "sort | uniq -c" and then doing a merge of those sorted counts. And then there would be countless arguments about whether you have to count the time it takes to stage the data into the cluster as part of the task completion benchmark... |
|
What I'm really optimizing here is the functional equivalent of `sort | uniq -c | sort -n`