|
|
|
|
|
by joshmillard
4003 days ago
|
|
Your wager is good, on both counts; the current corpus is based on the first 14 months of strips, which comes out to about 10K total words for Calvin (who has more lines than the rest of the cast combined, in that chunk of strips). That's not enough to generate a lot of variety on anything other than very common word combinations. Strings of prepositions and articles are the most likely inflection points where you'll commonly see two distinct phrases glued together. If I get the whole strip run into the corpus, that'll kick Calvin up to something more like 80-90K words, which will help with the variety a good bit, and the other characters will have more of a shot of current Calvin-like variety, but it's still a relatively small training set. By comparison I've done some markov model experiments based on multi-million word corpora and that gets a lot farther into the territory of regularly producing satisfyingly weird disjunctures. |
|