| HN Mirror

I'm glad you found it valuable! Both are good questions and I haven't gone far enough mapping the code to Elman's architecture to know the answer to the second.

For your first question, using three hidden layers makes it a little clearer what the network does. Each layer performs one step of the calculation. The first layer collects what is known from the current token and what we knew after the calculation for the previous token. The second layer decides whether the current token looks like program code, by checking if it satisfies the decision rule. The third layer compares the decision with what we decided for previous tokens.

I think that this could be compressed into a single hidden layer, too. A ReLU should be good enough at capturing non-linearities so this should work.