Hacker News new | ask | show | jobs
by frisco 3938 days ago
Fun hack. If anything, it highlights how compelling deep learning and RNNs are: no messing with NLP, no messing with building other features or adding up classifiers, etc. The manual feature engineering means it might work better on a smaller dataset, but even then probably not.

For comparison with Andrej Karpathy's RNN code (http://karpathy.github.io/2015/05/21/rnn-effectiveness/) training on the "HarryPotter(xxlarge).txt" (76K) file using the default hyperparameters and a batch size of 25 gets me:

  > But Atfa the loom proset! No contarin — mibll,’s just pucking to live
  > note left them hard and fitther, clooked of course little happered to
  > trige on the fistpened. Their knew Harry mear from the shind-beas
  > eveided, at Uncle Vernon’s thepped to spept were pelled and beadn
  > Harry, distine dy use. Harry had in a amalout, into the fish sfary door.
The difference here is tokenizing on words vs letters: the RNN code is trying to learn the structure of English from completely zero whereas the code here gets to work with well-formed words from the beginning. But otherwise, the results in the linked post are about as silly semantically:

  > Input: "Harry don't look"
  > Output: "Harry don't look , incredibly that a year for been parents in .
  >   followers , Harry , and Potter was been curse . Harry was up a year ,
  >   Harry was been curse "
EDIT: Updated the RNN output text. Was sampling from a checkpoint file for a different input corpus. Got confused by the long similar-looking filenames. Doesn't change the overall point though.
2 comments

I just can't agree that a simple, linear-time operation like "tokenizing words and basic n-gram models out of them" is a tedious problem like you seem to be implying, nor do I feel a solution to this very-solved problem is "compelling". Word tokenization and n-gram models are simple, unreasonably effective, and very fast. If character-based RNNs do better (albeit far more slowly during training), great, but nothing to see here, let's move along.

As I've posted here before, people have been training character n-gram models and getting language modeling performances comparable to those from word-based models---without using neural networks---for at least a decade. That it works with RNNs is no surprise because it worked just fine with the much more constrained predecessor technology.

My problem isn't that the feature engineering is expensive or tedious, it's that it's privileging a lot of information that NNs learn from the data. Yeah ok, Markov models (n-grams) are simple and fast and produce good results for generating representative text.

Deep RNNs are simple and produce good results for a huge, diverse range of problems with no new domain information. As Andrej Karpathy wrote:

> Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times.

N-grams don't have nearly the power (eg longer-than-N-range structure like grammar) and don't generalize nearly as well, making them a lot less surprising.

I'd argue that Karpathy's is much better.

It is learning phrases like so that you need the and that's the one from scratch. This version looks better at first glance because it is using correctly spelled words, but it repeatedly makes syntactic errors like for been/was been.