Hacker News new | ask | show | jobs
by hnfong 1145 days ago
You don't _need_ to hash anything.

The simplest way to do it, is to convert all the words to (normal_form, orig_word) pairs, write the list to a file, then sort it.

It will be trivial to find the words with common normal form after the sort.

(Of course, you wouldn't catch me trying to implement that with C if perl is an option...)

2 comments

But the normal form as defined by the article is an hash, even if you then use sorting instead of bucketing
The point is that you don’t need a hash table implementation. And the normalization that is used as the sorting key doesn’t otherwise have good hash-like qualities (like constant size and a good distribution) and thus doesn’t especially deserve to be called a hash.
How do you extract the last (largest) entry for each normalized key from the sorted list? What is the command line function?
It can be a simple ~20 line C program that checks whether the previous line has the same normalized key as the current line. It doesn't require hashing. I didn't say you could do it all with standard unix programs.
You pipe the sorted list into awk (for example) and append the second field to a list as long as the value of the first field remains the same. Whenever the value of the first field changes, and in the END block, you output the list (which contains the matching anagrams) and reset it to empty.

No hash table needed, just splitting the line into the two fields, equality comparison, and appending values to a list.

Go try your method and spot the bug. If you can’t, then respond and I’ll tell you. hint: you need one more manipulation to find the answer before or after the sort, there’s a command line version of it but it uses .. a hash.