Hacker News new | ask | show | jobs
Show HN: Neural network that impersonates writers (github.com)
48 points by jacob_plaster 3938 days ago
9 comments

I haven't looked at the code, but glancing at the results leaves me thinking it might need more work.

The output seems to me around the level a Markov chain might produce. Karpathy's RNN code produces much, much better results[1].

I wonder if manually extracting features and training the RNN on that is a mistake? RNN's tend to work well on text because they encode understanding of the parse tree themselves.

[1] https://github.com/karpathy/char-rnn

this doesn't look like a neural net to me. from NeuralNetwork.py

  from sklearn.neighbors import KNeighborsClassifier
  # Create a sperate neural network for each identifier
  for index in range(0, len(NaturalLanguageObject._Identifiers)):
       nn = KNeighborsClassifier()
       self._Networks.append(nn)
So then it seems like the author of the code doesn't understand that "NN" means "Nearest Neighbors" and not "Neural Network"?

He mentions that he used sklearn's Neural Network libraries in his blog post, but sklearn doesn't have any aside from RBM.

It is almost the perfect example of the "Danger Zone" in http://drewconway.com/zia/2013/3/26/the-data-science-venn-di...
I was confused with the difference between an SVM and a neural network. Easy mistake I guess. The whole goal of this project was for educational purposes, so im still happy with the outcome.
OMG, how embarrassing for OP. This is what scares me about technical blogging. Messing up unknowingly in an area I'm not experienced in and getting scathing critiques from my fellow hackers. Keep your chin up OP and next time remember to do your homework. +1 for the effort anyways.
if you want to learn neural nets check out Karpathy's class (cs231n.github.io) and do the assignments. making a github repo and HN post about using neural networks is false self-advertising and illegitimatizes those of us who know what we are talking about.
"Easy mistake I guess"

What the hell.

Sorry to beat this issue to death, but this is not an easy mistake at all, even given only a cursory understanding of the field.
Should i be impressed that even though the author was not using the tool they thought then still got it to spit out something?

it must be a robust tool.

I've glanced at the code now. You are absolutely correct.

Wow.

There's some good reasons to think this approach won't work at all. If I understand it correctly I think it is attempting to predict part of speech using previously observed values.

That's an interesting idea, and might be somewhat valuable as a feature to use in a text generator, but on its own won't be enough to ever generate sentences that make sense (because some specific sequences just don't make sense).

I am afraid this author has no idea what he is doing - and is loosely throwing around terms he does not understand. What the hell was his normalization procedure. Dangerous to readers who do not know a lot and will get confused while reading.
I ran a Markov chain text generator on Finnegans Wake once. It came out looking much the same. :-)
Fun hack. If anything, it highlights how compelling deep learning and RNNs are: no messing with NLP, no messing with building other features or adding up classifiers, etc. The manual feature engineering means it might work better on a smaller dataset, but even then probably not.

For comparison with Andrej Karpathy's RNN code (http://karpathy.github.io/2015/05/21/rnn-effectiveness/) training on the "HarryPotter(xxlarge).txt" (76K) file using the default hyperparameters and a batch size of 25 gets me:

  > But Atfa the loom proset! No contarin — mibll,’s just pucking to live
  > note left them hard and fitther, clooked of course little happered to
  > trige on the fistpened. Their knew Harry mear from the shind-beas
  > eveided, at Uncle Vernon’s thepped to spept were pelled and beadn
  > Harry, distine dy use. Harry had in a amalout, into the fish sfary door.
The difference here is tokenizing on words vs letters: the RNN code is trying to learn the structure of English from completely zero whereas the code here gets to work with well-formed words from the beginning. But otherwise, the results in the linked post are about as silly semantically:

  > Input: "Harry don't look"
  > Output: "Harry don't look , incredibly that a year for been parents in .
  >   followers , Harry , and Potter was been curse . Harry was up a year ,
  >   Harry was been curse "
EDIT: Updated the RNN output text. Was sampling from a checkpoint file for a different input corpus. Got confused by the long similar-looking filenames. Doesn't change the overall point though.
I just can't agree that a simple, linear-time operation like "tokenizing words and basic n-gram models out of them" is a tedious problem like you seem to be implying, nor do I feel a solution to this very-solved problem is "compelling". Word tokenization and n-gram models are simple, unreasonably effective, and very fast. If character-based RNNs do better (albeit far more slowly during training), great, but nothing to see here, let's move along.

As I've posted here before, people have been training character n-gram models and getting language modeling performances comparable to those from word-based models---without using neural networks---for at least a decade. That it works with RNNs is no surprise because it worked just fine with the much more constrained predecessor technology.

My problem isn't that the feature engineering is expensive or tedious, it's that it's privileging a lot of information that NNs learn from the data. Yeah ok, Markov models (n-grams) are simple and fast and produce good results for generating representative text.

Deep RNNs are simple and produce good results for a huge, diverse range of problems with no new domain information. As Andrej Karpathy wrote:

> Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times.

N-grams don't have nearly the power (eg longer-than-N-range structure like grammar) and don't generalize nearly as well, making them a lot less surprising.

I'd argue that Karpathy's is much better.

It is learning phrases like so that you need the and that's the one from scratch. This version looks better at first glance because it is using correctly spelled words, but it repeatedly makes syntactic errors like for been/was been.

Have you considered the copyright on the Harry Potter training data?
fair use - education
I would still be cautious. There is no need to use exactly this text.

Anyway, there are countries that don't have a fair use policy. So in this countries your repository could not legally be used.

IMHO, this is an unnecessary use of copyrighted material when there are thousands of equally well suited texts that have fallen out of copyright.

What you say is true - although I'm not sure it matters. There are heaps of art forms which are out of copyright, why should people critique/parody new films? Because they're relevant and people understand the content
> I decided to use scikit's machine learning libraries. [...] The writer I create uses multiple SVM engines. One large neural network for the sentence structuring and multiple small networks for the algorithm which selects words from a vocabulary.

This person has no idea what they're talking about. sklearn has no neural network code whatsoever.

EDIT: this feels like a testament to sklearn's greatness, honestly.

I'd be interested to know if this could be turned into a tool that lets you know how well your writing (or coding) matches the "house style". (Mostly for technical documentation, requirements specs etc...)

I'd be even more interested if it could be turned into a sublime text plugin that highlights words / phrases that deviate most strongly from the house style.

Good idea. I guess that was my main goal really, learning the structure of some text. The vocabulary generator was a pretty recent add in which is why it is quite in-accurate.
This is brilliant! I tried it out. Waiting for a larger data set! +1