Cool work! I'm very interested in this topic. Just wondering, how good does it generalize your training data other than just remembering strict input-output mapping?
The bootstrap version generalizes with 97% accuracy on a new image. Because the vocabulary is limited, you can train the model overnight. To make the model generalize with all the HTML/CSS markup you need significantly more compute.