You could encode the input with one-hot decimal digits. Then checking divisibility by 5 is just seeing whether input 0 or 5 of the least significant digit is high. Divisibility by 3 requires checking whether the digit sum is divisible by 3 which you could easily do by connecting a neuron to all inputs, weighing the inputs proportional to the digit they represent and activate if the sum is divisible by 3. For the three digit numbers from 0 to 999 this would require 30 inputs, one neuron to check divisibility by 5, twenty neurons to check divisibility by 3 and four neurons in the output layer to create the desired outputs. You could reduce the number of neurons a bit if you added another layer and did the divisibility by 3 in one additional step, I think twelve would do if you subtracted 10 or 20 with four neurons and just checked for 0, 3, 6 and 9 with eight neurons. You could also choose the input encoding to be one-hot base 15 digits which would make the net trivial and offload the hard part into the input encoding. Whether you could easily arrive at that design by training is of course a whole different question.
I suspect that there is no way to predict in advance the smallest model which can represent any given function. However, we should be able to at least develop an intuition for depth and breadth that is sufficient on any given problem. Classically, division circuits are large and complex, so I naively expect a NN built out of add, mul, and Heaviside(x)*x to also be large and complex.
As with many models, I suspect that this network really learned some other property that has almost-but-not-quite the same pattern as divisible-by-N.