|
|
|
|
|
by a_e_k
1755 days ago
|
|
I got curious about how all the other plays ranked. For example, which plays used those words the least? So I downloaded the text files from the Folger Shakespeare Library (https://shakespeare.folger.edu/downloads/txt/shakespeares-wo...) and ran this command to get a rough count of "the" and "th'" vs. the total words: for f (*.txt); do echo $f; tr " \r" "\n\n" < $f | grep -A100000 ACT | egrep -v "^[A-Z]+$" | grep -i [A-Z] | egrep -i "^th['e]$" | wc -l; tr " \r" "\n\n" < $f | grep -A100000 ACT | egrep -v "^[A-Z]+$" | grep -i [A-Z] | wc -l; echo; done
Here were the results that gave me, formatted into a table and sorted by descending frequency: | Rank | Play | The or Th’ | Words | Per 10000 |
|------+-----------------------------+------------+-------+-----------|
| 1 | macbeth | 724 | 16929 | 427.7 |
| 2 | henry-v | 1065 | 25577 | 416.4 |
| 3 | coriolanus | 1126 | 27294 | 412.5 |
| 4 | loves-labors-lost | 855 | 21093 | 405.3 |
| 5 | henry-viii | 962 | 24074 | 399.6 |
| 6 | the-merchant-of-venice | 834 | 20985 | 397.4 |
| 7 | henry-iv-part-2 | 1001 | 25762 | 388.6 |
| 8 | henry-vi-part-2 | 990 | 25597 | 386.8 |
| 9 | hamlet | 1142 | 30006 | 380.6 |
| 10 | henry-iv-part-1 | 856 | 24100 | 355.2 |
| 11 | henry-vi-part-3 | 866 | 24491 | 353.6 |
| 12 | antony-and-cleopatra | 861 | 24465 | 351.9 |
| 13 | king-lear | 898 | 25661 | 349.9 |
| 14 | the-winters-tale | 854 | 24568 | 347.6 |
| 15 | king-john | 717 | 20730 | 345.9 |
| 16 | cymbeline | 959 | 27738 | 345.7 |
| 17 | a-midsummer-nights-dream | 564 | 16377 | 344.4 |
| 18 | richard-iii | 985 | 28914 | 340.7 |
| 19 | richard-ii | 753 | 22224 | 338.8 |
| 20 | henry-vi-part-1 | 715 | 21575 | 331.4 |
| 21 | pericles | 605 | 18282 | 330.9 |
| 22 | troilus-and-cressida | 837 | 25810 | 324.3 |
| 23 | titus-andronicus | 659 | 20621 | 319.6 |
| 24 | alls-well-that-ends-well | 724 | 22683 | 319.2 |
| 25 | measure-for-measure | 693 | 21858 | 317.0 |
| 26 | the-tempest | 518 | 16489 | 314.1 |
| 27 | the-comedy-of-errors | 455 | 14552 | 312.7 |
| 28 | the-two-noble-kinsmen | 735 | 23751 | 309.5 |
| 29 | julius-caesar | 592 | 19251 | 307.5 |
| 30 | as-you-like-it | 664 | 21692 | 306.1 |
| 31 | twelfth-night | 573 | 19675 | 291.2 |
| 32 | othello | 737 | 25670 | 287.1 |
| 33 | much-ado-about-nothing | 591 | 20843 | 283.5 |
| 34 | romeo-and-juliet | 677 | 23948 | 282.7 |
| 35 | the-merry-wives-of-windsor | 604 | 21603 | 279.6 |
| 36 | timon-of-athens | 504 | 18262 | 276.0 |
| 37 | the-taming-of-the-shrew | 449 | 18709 | 240.0 |
| 38 | the-two-gentlemen-of-verona | 404 | 17010 | 237.5 |
Ignoring the whole log-likelihood stuff and just looking at the simple frequencies, I'm not completely sure that I buy the article's argument. Macbeth does come out on top by my analysis. But some of the other plays seem to use "the" or "th'" nearly as frequently without being particularly creepy. In terms of ratios of the frequencies, Henry V, a history, is only 2.6% lower than Macbeth. And the first comedy, Love's Labors Lost, is just 5.2% lower. |
|
Like if we assumed that all English language is generated from a weighted distribution of all words and “the” is 3.5%, is a 4.3% occurrence rate even significant? (And what even would be the base occurrence rate?)