| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by muzzio 2845 days ago

Extremely simple one, but my favorite is an algorithm for determining if two words are anagrams of each other:

The Fundamental Theorem of Arithmetic states: "every integer greater than 1 either is a prime number itself or can be represented as the product of prime numbers and that, moreover, this representation is unique, up to (except for) the order of the factors."[1]

So to determine that two words are anagrams of each other, you can assign each letter to a unique prime number (a = 2, b = 3, c = 5 etc.), then compute the product of those numbers, and if they're equal then those two words are anagrams.

[1] - https://en.wikipedia.org/wiki/Fundamental_theorem_of_arithme...

3 comments

amelius 2845 days ago

Nice in theory, but in practice you wouldn't implement it like that, especially if the words can be longer than your machine integer allows. Sorting and comparing is more elegant than invoking a BigNum library, imho (and has smaller footprint). This shows that theoretical elegance != implementation elegance.

LukeShu 2845 days ago

I was curious how soon overflowing a native integer would come up. The "worst case" would be all "z"s (which map to 103), so how many characters does floor(log_103(2^n-1)) get you?

    int32  zzzz      (4 characters)
    uint32 zzzz      (4 characters)
    int64  zzzzzzzzz (9 characters)
    uint64 zzzzzzzzz (9 characters)

But that's the worst case, not many real words have several "z"s in them. How often are real words affected? I did some experiments with /usr/share/dict/words:

    1-( 36123/123115) = 70%  overflow int32
    1-( 39774/123115) = 68%  overflow uint32
    1-(117909/123115) = 4.2% overflow int64
    1-(118533/123115) = 3.7% overflow uint64

mturmon 2845 days ago

Hmm, nice result. That makes me want to map letters onto primes in order of frequency. So "e" is 2, "t" is 3, etc., see: http://pi.math.cornell.edu/~mec/2003-2004/cryptography/subs/...

romwell 2845 days ago

At this point, you're yak shaving, where the sane thing to do is simply to count the number of times each letter occurs.

Better complexity-wise as well, both in terms of speed and storage: O(n) and O(1), respectively. [1]

The proposed algorithm uses O(n) storage (to store the immense product), and given that integer multiplication complexity is somewhere between O(n) and O(n^2), we end up[2] with at least O(n^2) runtime complexity!

[1] ...assuming the input data is less than 15 million terabytes in size; O(n log n), O(log n) respectively for arbitrary length input. Storing the number of input bytes takes O(log n) space, and integer increment can take O(size).

[2]https://en.wikipedia.org/wiki/Computational_complexity_of_ma...

mturmon 2844 days ago

OK, on my (longer at 235886 words) /usr/share/dict/words, I find that:

  sequential encoding: 21882 words overflow 2**64 (9.3%)
  frequency encoding: 2945 words overflow 2**64 (1.2%)

Some other trivia:

  mean #bits for those words over 64 bits: 
         seq = 72.0; freq = 69.3
  largest #bits: 
         seq = 115.4; freq = 101.0
  word w/ largest #bits: 
         seq = thyroparathyroidectomize [1]
         freq = pathologicopsychological [n/a]

[1] https://www.merriam-webster.com/medical/thyroparathyroidecto...

thijsvandien 2844 days ago

Now add support for Unicode. :)

artemisyna 2845 days ago

Or, for better time complexity (with a bit of extra space) making a map of char -> frequency and comparing the results.

diegoperini 2845 days ago

That's Fourier transform!

DavidSJ 2845 days ago

Funny. :) But a Fourier transform is reversible.

coherentpony 2845 days ago

The title asked for beautiful/elegant algorithms, not efficient ones! :)

TuringTest 2845 days ago

In computing, inefficient algorithms are not beautiful.

pythux 2845 days ago

I wrote some comparison some time ago https://github.com/remusao/anagrams-bench/blob/master/README...

Unfortunately, as you suggest, the complexity of arbitrary precision multiplication kicks in and performance suffers.

brlewis 2844 days ago

One might also choose not to bring in a BigNum library if you don't need to find every anagram. Here's an exhaustive list of anagrams I found with sorting and comparing that a quick Fundamental Theorem implementation missed:

   [ 'basiparachromatin', 'Marsipobranchiata' ]
   [ 'configurationism', 'misconfiguration' ]
   [ 'constructionism', 'misconstruction' ]
   [ 'pericardiacophrenic', 'phrenicopericardiac' ]
   [ 'anatomicophysiologic', 'physiologicoanatomic' ]
   [ 'petrographically', 'pterylographical' ]

brlewis 2844 days ago

Note this is out of 20K anagrams.

Here's my code: https://github.com/brlewis/brlewis.github.io/blob/master/201...

On my machine, Fundamental Theorem of Arithmetic finished in 1.703 seconds, sorting letters in 1.954 seconds.

TN1ck 2845 days ago

This one came up a lot when I was researching the most efficient way to generate anagrams, but the numbers are growing way too fast. In the end, a frequency map and comparing the frequencies worked best. My use case is a bit special, as I'm creating anagram sentences, so a Trie or similar things wouldn't work. Check out https://anagrams.io if you're interested.

_pastel 2845 days ago

I have a pet implementation of the frequency map that I'm overly fond of, for ascii strings:

  (1) keep an array of length 127 that you re-use and set to 0 between calls
  (2) for each character in the first string, increment the array at the character's index
  (3) for each character in the second string, decrement the array at the character's index

If you end up with all 0s, it's an anagram.

idontpost 2844 days ago

Better to use a hash map instead of an array. When iterating the second string, as soon as you see a character without an existing value it's not an anagram. If you finish the second string, check that all values are zero.

This way you only need to check the character values you actually use, and not all 127. It also generalizes trivially to larger character sets.

mark-r 2844 days ago

An array is simply an optimized specialization of a hash map.

panda88888 2844 days ago

His approach is the same as yours. As soon as the algorithm sees a character from second string where the value for that char in the array is zero, the second string is not an anagram of first string and can return false immediately.

DarkWiiPlayer 2845 days ago

Doesn't that mean in the end you have to check 127 values for if they are 0?

Or 64, if you store numbers as 32-bit integers and compare them as 64-bit using a union type.

karmakaze 2844 days ago

You could also keep a counter of the number of non-zero entries and update on zero/non-zero transitions.

romwell 2845 days ago

Given the constraints of the problem, it's the sanest thing to do.