I haven't looked at the code, but glancing at the results leaves me thinking it might need more work.
The output seems to me around the level a Markov chain might produce. Karpathy's RNN code produces much, much better results[1].
I wonder if manually extracting features and training the RNN on that is a mistake? RNN's tend to work well on text because they encode understanding of the parse tree themselves.
this doesn't look like a neural net to me. from NeuralNetwork.py
from sklearn.neighbors import KNeighborsClassifier
# Create a sperate neural network for each identifier
for index in range(0, len(NaturalLanguageObject._Identifiers)):
nn = KNeighborsClassifier()
self._Networks.append(nn)
I was confused with the difference between an SVM and a neural network. Easy mistake I guess. The whole goal of this project was for educational purposes, so im still happy with the outcome.
OMG, how embarrassing for OP. This is what scares me about technical blogging. Messing up unknowingly in an area I'm not experienced in and getting scathing critiques from my fellow hackers. Keep your chin up OP and next time remember to do your homework. +1 for the effort anyways.
if you want to learn neural nets check out Karpathy's class (cs231n.github.io) and do the assignments. making a github repo and HN post about using neural networks is false self-advertising and illegitimatizes those of us who know what we are talking about.
I've glanced at the code now. You are absolutely correct.
Wow.
There's some good reasons to think this approach won't work at all. If I understand it correctly I think it is attempting to predict part of speech using previously observed values.
That's an interesting idea, and might be somewhat valuable as a feature to use in a text generator, but on its own won't be enough to ever generate sentences that make sense (because some specific sequences just don't make sense).
I am afraid this author has no idea what he is doing - and is loosely throwing around terms he does not understand. What the hell was his normalization procedure. Dangerous to readers who do not know a lot and will get confused while reading.
Fun hack. If anything, it highlights how compelling deep learning and RNNs are: no messing with NLP, no messing with building other features or adding up classifiers, etc. The manual feature engineering means it might work better on a smaller dataset, but even then probably not.
For comparison with Andrej Karpathy's RNN code (http://karpathy.github.io/2015/05/21/rnn-effectiveness/) training on the "HarryPotter(xxlarge).txt" (76K) file using the default hyperparameters and a batch size of 25 gets me:
> But Atfa the loom proset! No contarin — mibll,’s just pucking to live
> note left them hard and fitther, clooked of course little happered to
> trige on the fistpened. Their knew Harry mear from the shind-beas
> eveided, at Uncle Vernon’s thepped to spept were pelled and beadn
> Harry, distine dy use. Harry had in a amalout, into the fish sfary door.
The difference here is tokenizing on words vs letters: the RNN code is trying to learn the structure of English from completely zero whereas the code here gets to work with well-formed words from the beginning. But otherwise, the results in the linked post are about as silly semantically:
> Input: "Harry don't look"
> Output: "Harry don't look , incredibly that a year for been parents in .
> followers , Harry , and Potter was been curse . Harry was up a year ,
> Harry was been curse "
EDIT: Updated the RNN output text. Was sampling from a checkpoint file for a different input corpus. Got confused by the long similar-looking filenames. Doesn't change the overall point though.
I just can't agree that a simple, linear-time operation like "tokenizing words and basic n-gram models out of them" is a tedious problem like you seem to be implying, nor do I feel a solution to this very-solved problem is "compelling". Word tokenization and n-gram models are simple, unreasonably effective, and very fast. If character-based RNNs do better (albeit far more slowly during training), great, but nothing to see here, let's move along.
As I've posted here before, people have been training character n-gram models and getting language modeling performances comparable to those from word-based models---without using neural networks---for at least a decade. That it works with RNNs is no surprise because it worked just fine with the much more constrained predecessor technology.
My problem isn't that the feature engineering is expensive or tedious, it's that it's privileging a lot of information that NNs learn from the data. Yeah ok, Markov models (n-grams) are simple and fast and produce good results for generating representative text.
Deep RNNs are simple and produce good results for a huge, diverse range of problems with no new domain information. As Andrej Karpathy wrote:
> Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times.
N-grams don't have nearly the power (eg longer-than-N-range structure like grammar) and don't generalize nearly as well, making them a lot less surprising.
It is learning phrases like so that you need the and that's the one from scratch. This version looks better at first glance because it is using correctly spelled words, but it repeatedly makes syntactic errors like for been/was been.
What you say is true - although I'm not sure it matters. There are heaps of art forms which are out of copyright, why should people critique/parody new films? Because they're relevant and people understand the content
> I decided to use scikit's machine learning libraries. [...] The writer I create uses multiple SVM engines. One large neural network for the sentence structuring and multiple small networks for the algorithm which selects words from a vocabulary.
This person has no idea what they're talking about. sklearn has no neural network code whatsoever.
EDIT: this feels like a testament to sklearn's greatness, honestly.
I'd be interested to know if this could be turned into a tool that lets you know how well your writing (or coding) matches the "house style". (Mostly for technical documentation, requirements specs etc...)
I'd be even more interested if it could be turned into a sublime text plugin that highlights words / phrases that deviate most strongly from the house style.
Good idea. I guess that was my main goal really, learning the structure of some text. The vocabulary generator was a pretty recent add in which is why it is quite in-accurate.
The output seems to me around the level a Markov chain might produce. Karpathy's RNN code produces much, much better results[1].
I wonder if manually extracting features and training the RNN on that is a mistake? RNN's tend to work well on text because they encode understanding of the parse tree themselves.
[1] https://github.com/karpathy/char-rnn