Hacker News new | ask | show | jobs
by ChancyChance 1134 days ago
What’s nostalgic to me is that this is such a classic Perl pattern: hashing manipulated strings to find relationships. Prior to Perl doing this was painful. Boost wasn’t a thing. Python was in its cradle and Java was still struggling with beans. Perl removed the barriers between complex coding ideas and an implementation that C wasn’t ready for. The time from thought to prototype was near instant compared to current compiled languages.
3 comments

Ummm, the author used awk for the first version in the 1990s.

> This was easy to do, even at the time, when the word list itself, at 2.5 megabytes, was a file of significant size. Perl and its cousins were not yet common; in those days I used Awk. But the task is not very different in any reasonable language:

I know. I’m talking about Perl. Try to keep up.
Do pay attention.

You wrote "Prior to Perl doing this was painful.".

I highlighted how you contradicted the author's claim that it was easy to do using awk. Awk existed long before Perl.

As a concrete example, while my awk skills are extremely rusty, here's a program which will normalize the input line, create a table mapping the normalized name to the matching original lines, then at the end it only displays the ones with at least 5 anagrams.

I tried to avoid modern awk features, like asort, to be something that would have worked in the 1980s:

  {
      # Convert to normal form:
      #  1. Fold to lower case
      #  2. Bin the letters to get frequency counts
      #  3. Only consider lower case ASCII letters
      split(tolower($0), letters, "");
  
      # Ignore asort() in modern awks and do a bin sort instead.
      for (i in letters) {
          c = letters[i];
          repeats[c] = repeats[c] c;
      }
  
      normal_form = "";
      for (i=97; i<=122; i++) {
          c = sprintf("%c", i); # no chr() in a 1980s awk
          normal_form = normal_form repeats[c];
      }
  
      table[normal_form] = table[normal_form] "," $0
      delete repeats;
  }
  END {
      # Only show the ones with at least 5 matches
      for (i in table) {
          match_str = table[i];
          split(match_str, matches, ",");
          num_matches = length(matches)-1
          if (num_matches >= 5) {
              # print the number of matches, then the match string
              printf("%d%s\n", num_matches, match_str);
          }
      }
  }
When I try it on a word list I have handy, here are the most common words:

  % awk -f anagram.awk < words_alpha.txt | sort -n -t, | tail -5
  13,elaps,lapse,leaps,lepas,pales,peals,pleas,salep,saple,sepal,slape,spale,speal
  14,anestri,antsier,asterin,eranist,nastier,ratines,resiant,restain,retains,retinas,retsina,stainer,starnie,stearin
  14,apers,apres,asper,pares,parse,pears,prase,presa,rapes,reaps,repas,spaer,spare,spear
  14,arest,aster,astre,rates,reast,resat,serta,stare,strae,tares,tarse,tears,teras,treas
  15,alerts,alters,artels,estral,laster,lastre,rastle,ratels,relast,resalt,salter,slater,staler,stelar,talers
Certainly Perl is more succinct, though note that even up to Perl 4 in the early 1990s you would need to use the string concatenation method to store the list of matches in the table.

But, "painful"? No. Not to someone who knew how to use awk.

What’s the word for when snark backfires?
snark backfires = barf earns kicks
“Sorry”?
krans
You don't _need_ to hash anything.

The simplest way to do it, is to convert all the words to (normal_form, orig_word) pairs, write the list to a file, then sort it.

It will be trivial to find the words with common normal form after the sort.

(Of course, you wouldn't catch me trying to implement that with C if perl is an option...)

But the normal form as defined by the article is an hash, even if you then use sorting instead of bucketing
The point is that you don’t need a hash table implementation. And the normalization that is used as the sorting key doesn’t otherwise have good hash-like qualities (like constant size and a good distribution) and thus doesn’t especially deserve to be called a hash.
How do you extract the last (largest) entry for each normalized key from the sorted list? What is the command line function?
It can be a simple ~20 line C program that checks whether the previous line has the same normalized key as the current line. It doesn't require hashing. I didn't say you could do it all with standard unix programs.
You pipe the sorted list into awk (for example) and append the second field to a list as long as the value of the first field remains the same. Whenever the value of the first field changes, and in the END block, you output the list (which contains the matching anagrams) and reset it to empty.

No hash table needed, just splitting the line into the two fields, equality comparison, and appending values to a list.

Go try your method and spot the bug. If you can’t, then respond and I’ll tell you. hint: you need one more manipulation to find the answer before or after the sort, there’s a command line version of it but it uses .. a hash.
Even in C it’s not that hard to make a basic hash table. It’s probably like 20 lines of code or something.