Hacker News new | ask | show | jobs
by BeeOnRope 2357 days ago
You could use a faster sort implementation, such as radix sort which is O(n) and also probably faster in practice when well-implemented.

One option is this one [1] which I actually wrote as part of this exact task: de-duplicating a list of integers, as part of working on [2].

I wrote about the types of speedups you can expect with radix sort here [3] - it depends on the size of the input elements and their distribution (e.g., if many top bits are zero, radix sort will be much faster), but in my test case I see a ~5x speedup for moderate or large input sizes (thousands of elements or more).

Of course, the same could be said about hash tables...

[1] https://github.com/travisdowns/sort-bench/blob/master/radix7...

[2] https://lemire.me/blog/2019/05/07/almost-picking-n-distinct-...

[3] https://travisdowns.github.io/blog/2019/05/22/sorting.html

3 comments

Radix sort is not O(1), did you mean O(n)? And even then it's only true for a constant k (where k is the size of the possible key space).
Right, O(n).

O(1), hah, I wish!

Yes, there is the caveat of constant k, but in practice this is usually implied for integer keys e.g., the int32 integers used in the OP.

If the keys become arbitrarily long, hash tables also become O(n log k) since you have to hash and compare log k bits of key state.

Patricia is O(N) in the size of the input data to be sorted, even without a bound on the key size, as are several more recently discovered suffix array sorting algorithms. Some other radix sorts are not, it's true.
Patricia tries are less local than typical Radix tree implementations.

In typical usage of thousands of items, locality wins, low constant factors win. Simple lazily defragmented list can be faster than constant size hash table thanks to constant factors, faster than sorting.

Unique objects have a perfect hashing function, but it still has to be fast. Sometimes making it an intrusive structure is a major speedup due to cache locality - let the unique object track where it is.

If you cannot know the maximum size (which you will with unique objects), then get more adaptive.

I did some work on suffix arrays, but didn't come across sorting algorithms.

Would you mind linking a few papers?

They're pretty new!

I think the first linear-time suffix-sorting algorithm was the "skew algorithm" introduced in "Linear work suffix array construction," J. Kärkkäinen, P. Sanders, and S. Burkhardt. Journal of the ACM, 53(6):918–936, 2006, although I think Kärkkäinen and Sanders published it in 2003 ("Simple linear work suffix array construction", J.C.M. Baeten et al. (Eds.): ICALP 2003, LNCS 2719, pp. 943–955, 2003.)

SA-IS, from 2009, has a vulgar explanation (by Satan!) at https://zork.net/~st/jottings/sais.html; he cites the paper as “Linear Suffix Array Construction by Almost Pure Induced-Sorting” by G. Nong, S. Zhang and W. H. Chan, which seems to be http://ge-nong.googlecode.com/files/Linear%20Suffix%20Array%... (404; from https://code.google.com/archive/p/ge-nong/). Nong seems to have gone on to do related external-sorting work until at least 2014, which is of course very important if you want to use suffix arrays to index large corpuses.

In 2016 Gonzalo Navarro and a couple of other guys published https://arxiv.org/abs/1607.04346, "Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time", J. Ian Munro, Gonzalo Navarro, Yakov Nekrich

(Submitted on 15 Jul 2016 (v1), last revised 14 Nov 2016 (this version, v2)). This constructs not only the suffix array but in fact an entire compressed index similar to the FM-index. This seems to follow up 2014 work by Belazzougui.

I think there's a third totally different linear-time suffix-array construction algorithm from the mid-oughties but I can't remember it.

See https://qr.ae/TVdxq0

Paper A (Two efficient algorithms for linear time suffix array construction, https://scholar.google.com/scholar?as_sdt=0,5&btnG=&hl=en&q=...) is the main research field of Pro. Ge Nong (issng@mail.sysu.edu.cn, Sun Yat-sen University, Guangzhou, 中山大学). It is also his first paper in this field. Sequent papers relevant of him are all based on this one. Paper B is Space Efficient Linear Time Construction of Suffix Arrays(http://alumni.cs.ucr.edu/~rakthant/cs234/03_KA_Simple%20Line...). Paper A is found suspected of plagiarizing the idea of the algorithm of paper B.

Years ago, there were analyses by the researchers in the same field. See Suffix Array Construction: KA’s algorithm and Induced Sorting, which is faster?(https://yangzhe1990.wordpress.com/2012/11/04/ka-and-induced-...) It gives a very polite comment in English after analysis. And 最好写的suffix tree(suffix array)算法?(https://yangzhe1990.wordpress.com/2012/09/14/suffix-array-su...) The analysis in Chinese points out that Pro. Ge Nong's paper is suspected of plagiarism.

-----------------------------

In addition:

See https://qr.ae/TV09Mt

I started my Ph.D. program at the school of data and computer science, Sun Yat-sen University(SYSU, Guanzhou), in 2012. Student ID:12110646. I am one of the first batch Ph.D. students(total 2) of Pro. Ge Nong (issng@mail.sysu.edu.cn).

In 2012, after I entered into SYSU, Pro. talked to me times implying bribe demand, no avail. Then, I had to research alone for 7 years and must be close to his research field according to the clauses of the university, no direction, no advice, no machine, no support even there is national funding provided to the supervisor for each PhD student.

In Apr. 2019, in the case of meeting the requirement of graduation, my last chance to apply for the degree, Pro. Ge Nong refused to sign relevant documents. And the local police station was asked to treat me as a concerned-object after my complaint to the department head.

Later in Apr. 2019, I complained to the President's Office and the Office of Government Ethics of the university, then my thesis was sent out for review, by the department, later than all the other applicants.

In May 2019, after the review results of all the other applicants came out, the department asked me to cooperate with their investigation otherwise they would not let me know my review results. During the inquiry, they told me that I did not pass the review and asked me to write a declaration “The case of implying bribe demand with regards to Pro. is insufficient of evidence and falls short of facts”, otherwise I can only apply for the certificate of completion, not the certificate of graduation. Yield to the pressure, I had to give in.

In Aug. 2019, my thesis was sent out for review again. On Sep. 2, I was told that I did not pass the review again and can not continue the progress of graduation. Comparing with all the other applications of this time, the review time cycle of my thesis is abnormally different. The graduate school is suspected of operating under the table. Meanwhile, I was asked to apply for the certificate of completion by the department again.

In Oct. 2019, I was rejected to apply for graduation and asked to apply for the certificate of completion or withdrawal, and leave the campus. Otherwise, I would be dropt out with an official document promulgated by the university.

In the case of meeting the requirement of graduation, the whole process of my application for graduation suffered resistance and backroom operation time and time again from the beginning, just because of not meeting the professor’s bribe demand, and about to being dropt out by the university in the end. I do not know whether all the thing is normal. I either do not know whether Pro. Ge Nong is qualified to be a teacher.

The issues were complained to the Ministry of Education of China in early Sep. 2019. I am waiting for the reply.

Appendix

1. Different from many kindred tragedies, I am an alive sample at present and suffering from institutionalized pressure and threat. Two places of names in the post are hidden for circumventing illegal persecution in the name of law.

2. Any legal aid is appreciated.

Can you also hash every key to effectively turn them into a number, and then perform radix sort?

If you hash with SHA-256 then any sorting operation seems like it’d just take O(256N)=O(N).

What am I missing? Clearly this cannot work because otherwise everyone would do this as we accept O(NlogN) as the norm!

> What am I missing? Clearly this cannot work because otherwise everyone would do this as we accept O(NlogN) as the norm!

Let's assume a perfect hash that never collides or leaves gaps, the best case for this.

For a radix sort in some base with K digits, N can only go so high before you have duplicates. Specifically, base^K >= N

Take the log of both sides. K >= logN

So with arbitrarily large numbers the time taken to sort is actually NlogN.

With fixed-size numbers it's "O(K*N)" but K is larger than logN would be.

I see. Then Radix Sort never really was faster than traditional sorts if we go by Big-O notation alone, right?

If I understand your post correctly, Radix sorts in O(log(maxsize) * N) instead of O(NlogN). So why do people say Radix Sort is O(N)?

People say that because it's true, for a bounded key size, and in practice many radix sort implementations have a large fixed key size, so the log factor just doesn't come up.

E.g., if you use a 64-bit key, with a default implementation, you support up to 2^64 elements with your O(n) sort. No one has a machine with more than 2^64 elements today, so the the sort is in practical terms very much like O(n).

OTOH the O(n log n) of comparison based sorts are very real: the log term is in n, the number of inputs, and it is very obvious when you plot sort performance.

What is interesting though is when you start optimizing radix sort in certain ways, e.g., eliminating passes based on the high bits being zero (in an LSD sort) or using an MSD sort which stops when the bucket size gets smaller than some threshold, the log n factor actually appears again: since you often need ~log n passes to complete the sort rather than say a fixed number like log 64 (base = digit bits) = 8 for 64-bit keys. The base of the logarithm is larger though (2^radix bits), so the hidden constant factor is lower than comparison sort where the base is generally 2.

One answer is that if you allow duplicates then it can handle larger amounts of items without any slowdown.

Another answer is that people are silly and constant vs. log factors aren't intuitive.

Calculating SHA256 is neither O(1) nor cheap.

Your idea is interesting and might be useful in some applications! But calculating a SHA256 takes time linear in the length of the thing you're hashing. Even if you assume fixed-size keys, SHA256 isn't fast. It's not designed to be fast.

You'd likely need an extremely high N before O(N) SHA256's is faster than O(N log N) comparisons.

> But calculating a SHA256 takes time linear in the length of the thing you're hashing.

Ahem, so does hashing.

And so does sort comparison.

The GP's idea is actually quite effective for medium-to-large objects.

As a bonus, you can cache and store the SHA256 of each object for future in-set testing and uniqueness-finding without looking at the objects themselves. (You cannot do this with non-cryptographic hashes or sorting by comparisons.)

This is basically what Git does.

Well SHA256 is a red herring here: one could suggest any suitable fast hash, non-crytopgraphic hash instead.

Yes, it takes time linear in the length of the thing you are sorting, but comparison sort is generally worse in that respect: comparison also takes up to linear time in the comprands, and you have to compare each element not just once but log n times [1].

So the time to hash all the elements once (if the hash is not already cached in the element) is going to be a small part of the sort time.

[1] There are ways around repeatedly re-comparing the entire key, but they work mostly in specialized cases.

You shouldn't accept O(n log n) as the norm: radix sort often gives you O(n) in practice (yes it is really something like O(n log k) for key size k, but k is often fixed, e.g., at 64 bits).

I think radix sorts are probably less popular since, the standard for sorting has always been to provide a "comparator" that compares two objects, and radix sort wants something different entirely: an key you can extract bits from instead. So radix sort was never a drop-in replacement e.g., for qsort in C.

Often you can provide such a key object (and indeed in trivial cases like sorting integers, the entire object is simply the key), but sometimes it would need to be longer than 32 or 64 bits, so the best interface isn't clear: especially in oldschool languages like C which want to pass function pointers as the comprand.

This was kind of cemented when e.g., C++ standard library made == and < the standard things you need to implement not just for std::sort, but also std::map and std::set and many algorithms that rely on order. It's similar in a way to how std::unordered_set/map wasn't a thing until much later when the hashing concept was introduced.

When you can use it, though, radix sort is great: sometimes an order of magnitude faster than comparison sort.

Isn't the OP's "hash-unique" technique basically a radix sort?
A little bit, yeah: it's like a bit like single-pass radix sort, i.e., where the "digit" in the radix sort is selected to be large enough that one digit encompasses the whole key. Another difference is bucket management strategy is different: using a fixed set of buckets with chaining rather than counting ahead of time the buckets you'll need and laying them out linearly.

Note that locality of reference is so important that even though "single pass" radix sort does the least work, the ideal radix turns out to be 6-10 bits usually, even if that means e.g., 5-10x more instructions executed for 64-bit keys: because sparsely writing into buckets spread across memory is just really slow.

Interesting, thanks for the explanation!
Not really, no. The only parallel is that you're putting things into buckets, but other than that they're very different algorithms.
Wouldn't the complexity, number of movements, etc also be the same? What's the difference?

Why wouldn't a radix sort approach suffer from the same disadvantages as the hash unique approach (e.g. locality of reference)?

It's worth giving the Wiki page a quick read, it's very accessible: https://en.wikipedia.org/wiki/Radix_sort

TL;DR: I guess the hash unique approach is technically a radix sort in base-64 if you hash the inputs before applying the radix sort, but a realistic radix sort will use far fewer buckets giving it much better cache locality and won't unnecessarily hash the buckets.

Also it's possible to write a radix sort that doesn't use any extra indirection, but that's a much more complex implementation and I doubt it really has any performance benefits.