Hacker News new | ask | show | jobs
by saltcured 237 days ago
Back in the day, optimizing this would be about parallel IO and some map-reduce processing. Data sharded on a bunch of nodes, each effectively doing "sort | uniq -c" and then doing a merge of those sorted counts.

And then there would be countless arguments about whether you have to count the time it takes to stage the data into the cluster as part of the task completion benchmark...

1 comments

I think you'd still need to go through that if you were really optimizing both `sort` and `uniq` working with their constraints.

What I'm really optimizing here is the functional equivalent of `sort | uniq -c | sort -n`