Hacker News new | ask | show | jobs
by a_e_k 1755 days ago
I got curious about how all the other plays ranked. For example, which plays used those words the least? So I downloaded the text files from the Folger Shakespeare Library (https://shakespeare.folger.edu/downloads/txt/shakespeares-wo...) and ran this command to get a rough count of "the" and "th'" vs. the total words:

    for f (*.txt); do echo $f; tr " \r" "\n\n" < $f | grep -A100000 ACT | egrep -v "^[A-Z]+$" | grep -i [A-Z] | egrep -i "^th['e]$" | wc -l; tr " \r" "\n\n" < $f | grep -A100000 ACT | egrep -v "^[A-Z]+$" | grep -i [A-Z] | wc -l; echo; done
Here were the results that gave me, formatted into a table and sorted by descending frequency:

    | Rank | Play                        | The or Th’ | Words | Per 10000 |
    |------+-----------------------------+------------+-------+-----------|
    |    1 | macbeth                     |        724 | 16929 |     427.7 |
    |    2 | henry-v                     |       1065 | 25577 |     416.4 |
    |    3 | coriolanus                  |       1126 | 27294 |     412.5 |
    |    4 | loves-labors-lost           |        855 | 21093 |     405.3 |
    |    5 | henry-viii                  |        962 | 24074 |     399.6 |
    |    6 | the-merchant-of-venice      |        834 | 20985 |     397.4 |
    |    7 | henry-iv-part-2             |       1001 | 25762 |     388.6 |
    |    8 | henry-vi-part-2             |        990 | 25597 |     386.8 |
    |    9 | hamlet                      |       1142 | 30006 |     380.6 |
    |   10 | henry-iv-part-1             |        856 | 24100 |     355.2 |
    |   11 | henry-vi-part-3             |        866 | 24491 |     353.6 |
    |   12 | antony-and-cleopatra        |        861 | 24465 |     351.9 |
    |   13 | king-lear                   |        898 | 25661 |     349.9 |
    |   14 | the-winters-tale            |        854 | 24568 |     347.6 |
    |   15 | king-john                   |        717 | 20730 |     345.9 |
    |   16 | cymbeline                   |        959 | 27738 |     345.7 |
    |   17 | a-midsummer-nights-dream    |        564 | 16377 |     344.4 |
    |   18 | richard-iii                 |        985 | 28914 |     340.7 |
    |   19 | richard-ii                  |        753 | 22224 |     338.8 |
    |   20 | henry-vi-part-1             |        715 | 21575 |     331.4 |
    |   21 | pericles                    |        605 | 18282 |     330.9 |
    |   22 | troilus-and-cressida        |        837 | 25810 |     324.3 |
    |   23 | titus-andronicus            |        659 | 20621 |     319.6 |
    |   24 | alls-well-that-ends-well    |        724 | 22683 |     319.2 |
    |   25 | measure-for-measure         |        693 | 21858 |     317.0 |
    |   26 | the-tempest                 |        518 | 16489 |     314.1 |
    |   27 | the-comedy-of-errors        |        455 | 14552 |     312.7 |
    |   28 | the-two-noble-kinsmen       |        735 | 23751 |     309.5 |
    |   29 | julius-caesar               |        592 | 19251 |     307.5 |
    |   30 | as-you-like-it              |        664 | 21692 |     306.1 |
    |   31 | twelfth-night               |        573 | 19675 |     291.2 |
    |   32 | othello                     |        737 | 25670 |     287.1 |
    |   33 | much-ado-about-nothing      |        591 | 20843 |     283.5 |
    |   34 | romeo-and-juliet            |        677 | 23948 |     282.7 |
    |   35 | the-merry-wives-of-windsor  |        604 | 21603 |     279.6 |
    |   36 | timon-of-athens             |        504 | 18262 |     276.0 |
    |   37 | the-taming-of-the-shrew     |        449 | 18709 |     240.0 |
    |   38 | the-two-gentlemen-of-verona |        404 | 17010 |     237.5 |
Ignoring the whole log-likelihood stuff and just looking at the simple frequencies, I'm not completely sure that I buy the article's argument. Macbeth does come out on top by my analysis. But some of the other plays seem to use "the" or "th'" nearly as frequently without being particularly creepy. In terms of ratios of the frequencies, Henry V, a history, is only 2.6% lower than Macbeth. And the first comedy, Love's Labors Lost, is just 5.2% lower.
3 comments

What would be the correct way of going about assessing statistical significance of these frequencies?

Like if we assumed that all English language is generated from a weighted distribution of all words and “the” is 3.5%, is a 4.3% occurrence rate even significant? (And what even would be the base occurrence rate?)

It’s quite something that the Scottish play comes out on top, however. It would work with a hypothesis that, for example, Shakespeare was using this pattern subconsciously, whenever the situation called for an eerie mood.

I’d also be interested in seeing if the 2:1 difference isn’t larger than for other authors?

It might be worth noting that all the other plays that scored anywhere close to the Scottish play (427) are much longer. You have to go down to 17 (344) to get to a shorter play; only 6 (397) and 15 (345) approach it. If we scale by length twice (count/length^2), the contrast becomes more stark (retaining original order):

    |    1 | macbeth                     |  724 | 16929 | 252 |
    |    2 | henry-v                     | 1065 | 25577 | 162 |
    |    3 | coriolanus                  | 1126 | 27294 | 151 |
    |    4 | loves-labors-lost           |  855 | 21093 | 192 |
    |    5 | henry-viii                  |  962 | 24074 | 165 |
    |    6 | the-merchant-of-venice      |  834 | 20985 | 189 |
    |    7 | henry-iv-part-2             | 1001 | 25762 | 150 |
    |    8 | henry-vi-part-2             |  990 | 25597 | 151 |
    |    9 | hamlet                      | 1142 | 30006 | 126 |
    |   10 | henry-iv-part-1             |  856 | 24100 | 147 |
    |   11 | henry-vi-part-3             |  866 | 24491 | 144 |
    |   12 | antony-and-cleopatra        |  861 | 24465 | 143 |
    |   13 | king-lear                   |  898 | 25661 | 136 |
    |   14 | the-winters-tale            |  854 | 24568 | 141 |
    |   15 | king-john                   |  717 | 20730 | 166 |
    |   16 | cymbeline                   |  959 | 27738 | 124 |
    |   17 | a-midsummer-nights-dream    |  564 | 16377 | 210 |
    |   18 | richard-iii                 |  985 | 28914 | 117 |
    |   19 | richard-ii                  |  753 | 22224 | 152 |
    |   20 | henry-vi-part-1             |  715 | 21575 | 153 |
    |   21 | pericles                    |  605 | 18282 | 181 |
    |   22 | troilus-and-cressida        |  837 | 25810 | 125 |
    |   23 | titus-andronicus            |  659 | 20621 | 154 |
    |   24 | alls-well-that-ends-well    |  724 | 22683 | 140 |
    |   25 | measure-for-measure         |  693 | 21858 | 145 |
    |   26 | the-tempest                 |  518 | 16489 | 190 |
    |   27 | the-comedy-of-errors        |  455 | 14552 | 214 |
    |   28 | the-two-noble-kinsmen       |  735 | 23751 | 130 |
    |   29 | julius-caesar               |  592 | 19251 | 159 |
    |   30 | as-you-like-it              |  664 | 21692 | 141 |
    |   31 | twelfth-night               |  573 | 19675 | 148 |
    |   32 | othello                     |  737 | 25670 | 111 |
    |   33 | much-ado-about-nothing      |  591 | 20843 | 136 |
    |   34 | romeo-and-juliet            |  677 | 23948 | 118 |
    |   35 | the-merry-wives-of-windsor  |  604 | 21603 | 129 |
    |   36 | timon-of-athens             |  504 | 18262 | 151 |
    |   37 | the-taming-of-the-shrew     |  449 | 18709 | 128 |
    |   38 | the-two-gentlemen-of-verona |  404 | 17010 | 139 |
with only 17 and 27 breaking 200, and still well shy of 252.

But the real point of the article is that the oddity of "the" in the frequency table attracted their attention to that word, and led them to identify an actual peculiarity in its usage. To say henry-v demonstrates anything similar, you would need to check if usage in that play is similarly peculiar (which I have not done either).

It seems odd to suggest (as some commenters have done) that the difference was subconscious. My null hypothesis is that peculiarities in usage by a professional wordsmith are deliberate. I expect to see actual evidence that the author didn't know what he was up to.

I’m not sure I understand why it’s not only valid but more precise to square the length of the play.

If we assume that length of a play has an influence on the frequency of stop words, shouldn’t we compare samples of each play? (First x pages or y randomly sampled words)

It is an observed phenomenon. That doesn't make it a theory, it makes it a generator of hypotheses. It is another chore to figure out ways to test the hypotheses, and more chores testing them.

After one passes several different tests, it might be worth publishing, along with the list of rejected hypotheses. Then somebody else might identify a test that it fails, and another that could be tested, and might publish that.

Or, more likely, nothing comes of it, and you move on to other phenomena and other hypotheses for them. That's science. It always starts with, "that's odd, I wonder what it means." And, most usually, it seems to just mean "huh."