Stupid question: We are trying to train the network how to calculate modulus of division. Of a binary encoding. Why should we expect this to be learnable in only two layers?
You could encode the input with one-hot decimal digits. Then checking divisibility by 5 is just seeing whether input 0 or 5 of the least significant digit is high. Divisibility by 3 requires checking whether the digit sum is divisible by 3 which you could easily do by connecting a neuron to all inputs, weighing the inputs proportional to the digit they represent and activate if the sum is divisible by 3. For the three digit numbers from 0 to 999 this would require 30 inputs, one neuron to check divisibility by 5, twenty neurons to check divisibility by 3 and four neurons in the output layer to create the desired outputs. You could reduce the number of neurons a bit if you added another layer and did the divisibility by 3 in one additional step, I think twelve would do if you subtracted 10 or 20 with four neurons and just checked for 0, 3, 6 and 9 with eight neurons. You could also choose the input encoding to be one-hot base 15 digits which would make the net trivial and offload the hard part into the input encoding. Whether you could easily arrive at that design by training is of course a whole different question.
I suspect that there is no way to predict in advance the smallest model which can represent any given function. However, we should be able to at least develop an intuition for depth and breadth that is sufficient on any given problem. Classically, division circuits are large and complex, so I naively expect a NN built out of add, mul, and Heaviside(x)*x to also be large and complex.
As with many models, I suspect that this network really learned some other property that has almost-but-not-quite the same pattern as divisible-by-N.
Technically, it's not two layers. It's only two layers of parameters. A RNN runs the two layers repeatedly at each sequential input. So if my input/output is like '1000~>fizz', it can be seen as a feedforward NN with at least 8 layers. (This is how it's possible to train a RNN in the first place: it gets 'unrolled' over time and turned into a big NN.)
The bit string representation actually isn't that bad for modulo computations: 2^n mod 3 is just 1 + (n % 2), and 2^n mod 5 is just {1,2,4,3}[n % 4].
Perfectly learnable to add them all up with the right weights? Given we're doing mod N, negative weights fit in naturally. So, pretty good seems plausible.