Hacker News new | ask | show | jobs
A kinda okay text generator (blog.ftql.pw)
77 points by pboyd2 2818 days ago
13 comments

POS tagging is a nice touch, but I've long wondered if there's a way to combine Markov sentence generation with actual grammatical rules -- that if there's an opening parenthesis there has to be a closing one, dependent clauses, etc. It would need to both gramatically parse the inputs as well as produce grammar trees for the outputs that would then be filled in by Markov chains that fit the trees...

Heck, even do it at a level larger than sentences too, so questions are followed by answers, a line of dialog is followed by a response, paragraphs follow a realistic distribution of lengths...

Hmm. A Markov chain where you actually do some kind of search? You build up a chain then backtrack and retry as necessary until you get something meeting the requirements.
See [0] for a method combining branch and prune with Markovian probabilities. I did a hacky version of my interpretation of this work at [1].

[0] https://link.springer.com/article/10.1007/s10601-010-9101-4

[1] https://github.com/kastnerkyle/pachet_experiments

Sounds like getting some sort of context free grammar out of it. There is a LR grammar algorithm that is very fast, you could use a Markov chain to generate and then a CFG to verify
Or a CFG to generate at a high level and a Markov chain to fill in the details
Excellent idea. Then it would actually read like language! But uh, how to choose sentence length and which grammars you want?
Children's poetry makes for fun source texts for these projects. In this case dumping a bunch of old poems generated mostly gibberish but I found the following examples amusing.

While here on my deathbed I try to relate My many misfortunes and miseries great. Poor thoughtless young thing! If I recollect right, I began life in March, on a clear frosty night; And before I could see or was half a week old, I nearly had perished, the barn was so cold. But this chilly spring I got pretty well over, But there wasn't a tree for miles around, They were too frightened to stay on the ground, And moused in the stable, or played in the clover, Or till I was weary, which seldom occurred, Ran after my tail, which I took for a bird

The wind did blow, the cloak did fly, Like streamer long and gay, Till loop and button failing both, At last it flew away. Then might all people well discern The bottles he had slung, A bottle swinging at each side, As hath been said or sung. The dogs did bark, the children screamed, Up flew the windows all, And ev'ry soul cried out, Well done!

When Betty screaming came down stairs, The wine is left behind! Good lack! Quoth he yet bring it me, My leathern belt likewise, In which I bear my trusty sword When I do exercise. Now Mistress Gilpin, careful soul!

There as the mother sits all day, On business from their houses, And late at night returning home, To cheer their babes and spouses; While you and I have oft-times heard How men are killed and undone, By overturns from carriages, By thieves, and fires in London. We know what risks these landsmen run, From noblemen to tailors.

Very nice! Especially "till loop and button failing both at last it flew away"

Feels like it's missing a sentence at the end that rhymes with tailor (eg "and equally assaults us the great big sleep, from skilled oarsmen to new sailors)

Lewis Carrol is actually really fun in just a pure markov chain; usually the diction is sufficient to put someone in the mood for Carrol, which means nobody really expects it to make sense anyways.
One of my favorite parts of Gödel, Escher, Bach was the description of translating Lewis Carroll's Jabberwocky into various languages -- particularly troublesome since the original poem contains few English words to begin with.
I think that just highlights how hard poetry is to translate in general. Carroll picked those words with thoughts to how they sound no less than other poets that restrict themselves to English words.

Translating poetry well where both meaning and sounds are vital can be harder than writing the poem in the first place was.

You’re going to want to look at https://en.wikipedia.org/wiki/Kneser–Ney_smoothing for further improvements on an ngram based approach.
Or even just Stupid Backoff[1]: If you can't find anything of length N, try N-1 but lower the probability by a fixed ratio (0.4).

[1] Page 2, equation 5 of https://www.aclweb.org/anthology/D07-1090.pdf#page=2

I will. Thanks for the suggestion.
Markov chains are cool and text generation is fun for experiments. I would definitely recommend looking at deep learning language models, for a more fruitful investment of time to improve further. (courses on http://www.fast.ai/ are a good starting place)
I too played around with War and Peace...but in the russian.

Shameless plug: My password generator that takes text inputs and generates wordish results: https://cmroanirgo.github.io/wordish/

(It's much simpler and less correct than OP's... but that's the point anyhow)

The site is pre configured to use a few different source text excerpts:

- The Happy Prince, Oscar Wilde

- Matthew 5, King James Bible

- War and Peace, Tolstoy (russian)

- klingon, anonymous web source

- Lorem Ipsum, lorem ipsim generator

A nice experiment! I am always a bit surprised to see that CNNs & company produce much less coherent and natural sounding text than the comparably simple Markov Chain generators.

That said, I markov-generated some text with Nietzsche's "Also sprach Zarathustra" as a source some time ago and it made more sense and was more coherent than the original.

I don't think this is necessarily true. CNNs can produce incredible results, see for example [0]. A snippet from the paper (their code is also available):

""" Prompt: The Mage, the Warrior, and the Priest

Story: A light breeze swept the ground, and carried with it still the distant scents of dust and time-worn stone. The Warrior led the way, heaving her mass of armour and muscle over the uneven terrain. She soon crested the last of the low embankments, which still bore the unmistakable fingerprints of haste and fear. She lifted herself up onto the top the rise, and looked out at the scene before her. [...] """

RNN/LSTM type models can also produce really coherent results on the scales shown in the blog.

Markov chains can be quite useful when you can manually create conditioning, but some of the key power of neural methods is learning to process conditioning and blend in ways that may not be obvious (or require significant domain expertise to grok) when handcrafting. YeGoblynQueenne below discusses a lot of the core issues for Markov chain based approaches.

In particular, plagiaristic sequences are incredibly common in Markovian generation, and though there are some papers on how to deal with this [1] it is not straightforward to decide what plagiarism really means in many contexts. This problem also arises in neural models, but isn't nearly as extreme due to the nature of both learning and the sampling process.

One huge bonus of Markov chains is that controlling exactly what you want or using hard rules is pretty easy, which is definitely NOT the case with neural generation...

This blog post is a really nice run through of how to get decent results from Markov chains, but don't discount neural methods either - they are getting better all the time, if you are willing to deal with the headaches.

[0] Hierarchical Neural Story Generation, https://arxiv.org/abs/1805.04833

[1] Max Order http://www.flow-machines.com/maxorder/

One problem I see in the article above is that the result is evaluated only by eyballing- which is actually a perfectly legitimate evaluation method; humans' language faculty, is, after all, the only process we know of that can correctly recognise and generate natural language expressions (and other, automatic, methods like BLEU scores, often simply automate the eyballing process).

However, if you're working with statistical language modelling, you should probably try to check the statistical properties of the resulting model! For example- measure perplexity, or mutual entropy, etc. This will give you a more concrete, and faster, way to evaluate your model than just generating some text and looking at it, trying to figure out where it came from. I'd suggest the author give that a try- they seem to have dipped a toe into statistical language modelling but be unwilling to go all the way with more advanced approaches (e.g. the cited paper, by Kassarnig, uses "n-grams, Justeson & Katz POS tag filter, recurrent neural networks, and latent Dirichlet allocation" - but the article author doesn't seem to have tried any of that on its own, let alone a combination, other than n-grams). The more advanced material can be daunting, but it's not actually such a huge leap from Markov chains. And the results can sometimes be worth the effort.

. . .

Btw- it sounds to me like the approach taken in the "N-Grams, take 2" section basically boils down to a PCFG (a Probabilistic CFG). It sounds like the probability of the next token is calculated conditional upon the probabilities of all preceding strings of tokens in a sentence, which is what a PCFG does (which is to say, PCFGs are not Markov). With an N high enough, that will, indeed, copy your text verbatim :)

. . .

The author should keep in mind that no matter what you do, Markov chains are always going to either look completely incoherent, or produce exact copies of their training text.

In fact, that's a bit of a problem with most statistical language modellng techniques. Even state of the art systems, that can produce very grammatical text, fail badly when it comes to producing coherent text (e.g. a perfectly reasonably structured recipe, where the listed ingredients are not used in the preparation instructions, etc). And that's something that no statistical measure of fit can capture, unfortunately. So you're left again with eyballing the end results and hoping for the best.

Edit: sorry, I'm babbling a bit. The way to use statistical evaluation metrics, like perplexity, is to first automatically evaluate your generated text, then eyball it to see how the metric matches your own "sense" of the text's grammaticality. So then, if you figure out that a low perplexity score produces more grammatical results, you favour models that produce low perplexity etc.

> they seem to have dipped a toe into statistical language modelling but be unwilling to go all the way with more advanced approaches

That's fair.

Thanks for your suggestions. I clearly have some studying to do.

I feel for fooling a quick skim, getting interpunction/capitalization correct/believable was the biggest improvement.
Ha.. yeah, I left that bit out. It was several dead ends and I finally ended up with something that was stupid simple. It seemed like more trouble to explain than it was worth.
What does "cweep" mean?
If it helps, here's the full paragraph:

  “We’ll send the infantwy down by the swamps,” Denísov continued.
  “They’ll cweep up to the garden; you’ll wide up fwom there with the
  Cossacks”—he pointed to a spot in the forest beyond the village—“and I
  with my hussars fwom here. And at the signal shot...”
What are "sramps"?
Wewease Bwian!
Your prose is too prolix.
Was fun to read. Thankd
Very impressed and delighted with your example results
This is going to be used for building better link farms in 3, 2, 1...