Hacker News new | ask | show | jobs
by ncmncm 1757 days ago
It might be worth noting that all the other plays that scored anywhere close to the Scottish play (427) are much longer. You have to go down to 17 (344) to get to a shorter play; only 6 (397) and 15 (345) approach it. If we scale by length twice (count/length^2), the contrast becomes more stark (retaining original order):

    |    1 | macbeth                     |  724 | 16929 | 252 |
    |    2 | henry-v                     | 1065 | 25577 | 162 |
    |    3 | coriolanus                  | 1126 | 27294 | 151 |
    |    4 | loves-labors-lost           |  855 | 21093 | 192 |
    |    5 | henry-viii                  |  962 | 24074 | 165 |
    |    6 | the-merchant-of-venice      |  834 | 20985 | 189 |
    |    7 | henry-iv-part-2             | 1001 | 25762 | 150 |
    |    8 | henry-vi-part-2             |  990 | 25597 | 151 |
    |    9 | hamlet                      | 1142 | 30006 | 126 |
    |   10 | henry-iv-part-1             |  856 | 24100 | 147 |
    |   11 | henry-vi-part-3             |  866 | 24491 | 144 |
    |   12 | antony-and-cleopatra        |  861 | 24465 | 143 |
    |   13 | king-lear                   |  898 | 25661 | 136 |
    |   14 | the-winters-tale            |  854 | 24568 | 141 |
    |   15 | king-john                   |  717 | 20730 | 166 |
    |   16 | cymbeline                   |  959 | 27738 | 124 |
    |   17 | a-midsummer-nights-dream    |  564 | 16377 | 210 |
    |   18 | richard-iii                 |  985 | 28914 | 117 |
    |   19 | richard-ii                  |  753 | 22224 | 152 |
    |   20 | henry-vi-part-1             |  715 | 21575 | 153 |
    |   21 | pericles                    |  605 | 18282 | 181 |
    |   22 | troilus-and-cressida        |  837 | 25810 | 125 |
    |   23 | titus-andronicus            |  659 | 20621 | 154 |
    |   24 | alls-well-that-ends-well    |  724 | 22683 | 140 |
    |   25 | measure-for-measure         |  693 | 21858 | 145 |
    |   26 | the-tempest                 |  518 | 16489 | 190 |
    |   27 | the-comedy-of-errors        |  455 | 14552 | 214 |
    |   28 | the-two-noble-kinsmen       |  735 | 23751 | 130 |
    |   29 | julius-caesar               |  592 | 19251 | 159 |
    |   30 | as-you-like-it              |  664 | 21692 | 141 |
    |   31 | twelfth-night               |  573 | 19675 | 148 |
    |   32 | othello                     |  737 | 25670 | 111 |
    |   33 | much-ado-about-nothing      |  591 | 20843 | 136 |
    |   34 | romeo-and-juliet            |  677 | 23948 | 118 |
    |   35 | the-merry-wives-of-windsor  |  604 | 21603 | 129 |
    |   36 | timon-of-athens             |  504 | 18262 | 151 |
    |   37 | the-taming-of-the-shrew     |  449 | 18709 | 128 |
    |   38 | the-two-gentlemen-of-verona |  404 | 17010 | 139 |
with only 17 and 27 breaking 200, and still well shy of 252.

But the real point of the article is that the oddity of "the" in the frequency table attracted their attention to that word, and led them to identify an actual peculiarity in its usage. To say henry-v demonstrates anything similar, you would need to check if usage in that play is similarly peculiar (which I have not done either).

It seems odd to suggest (as some commenters have done) that the difference was subconscious. My null hypothesis is that peculiarities in usage by a professional wordsmith are deliberate. I expect to see actual evidence that the author didn't know what he was up to.

1 comments

I’m not sure I understand why it’s not only valid but more precise to square the length of the play.

If we assume that length of a play has an influence on the frequency of stop words, shouldn’t we compare samples of each play? (First x pages or y randomly sampled words)

It is an observed phenomenon. That doesn't make it a theory, it makes it a generator of hypotheses. It is another chore to figure out ways to test the hypotheses, and more chores testing them.

After one passes several different tests, it might be worth publishing, along with the list of rejected hypotheses. Then somebody else might identify a test that it fails, and another that could be tested, and might publish that.

Or, more likely, nothing comes of it, and you move on to other phenomena and other hypotheses for them. That's science. It always starts with, "that's odd, I wonder what it means." And, most usually, it seems to just mean "huh."