Hacker News new | ask | show | jobs
by barrkel 2244 days ago
In order to determine the distinct items, the items need to be deduplicated. Generally that's done in only two ways: a hash table that skips items already seen, or a sort followed by a scan that skips over duplicates. The hash table is O(1), but the sort is easier to make parallel without sharing mutable state and has more established algorithms to use when spilling to disk.
1 comments

There is a third way: keep the data pre-sorted in the database (via an index).