Hacker News new | ask | show | jobs
by wodenokoto 238 days ago
To me, the really big win would be _not_ to have to sort at all. Have an option to keep first or last duplicate and remove all others while keeping line order is usually what I need.
2 comments

That's easy to do if you're keeping the first duplicate. It becomes complex if you're keeping the last duplicate, because every time you find a duplicate you have to go back through your "output" and delete the earlier occurrence.

You could do an annotating pass for learning which of each line is the last one, and then a followup pass for printing (or otherwise echoing) only the lines that are the last of their kind. Technically still faster than sorting.

You could also keep the information on last occurrence of each line in the hash map (that's where it's going to be anyway), and once you're done with the first pass sort the map by earliest last occurrence. That will get you the lines in the right order, but you had to do a sort. If the original input was mostly duplicates, this is probably a better approach.

You could also track last occurrence of each line in a separate self-sorting structure. Now you have slightly more overhead while processing the input, and sorting the output is free.

> It becomes complex if you're keeping the last duplicate, because every time you find a duplicate you have to go back through your "output" and delete the earlier occurrence.

Can't you reverse file | keep first occurrence | reverse output?

Is reversing the file an improvement over making two passes?
I've written this kind of function so many times it's not funny. I usually want something that is fed from an iterator, removes duplicates, and yields values as soon as possible.
I've added this functionality to `hist-0.1.5` with a benchmark of other tools that do this on the CLI