| HN Mirror

think of it this way, in the entire corpus of github, how often do you think that there are numeric identifiers that appear near terms like "id" where the numeric part is then used elsewhere with terms like "id" or terms that are frequently found near terms like "id"?

don't get me wrong, it's cool, but these models operate on a character by character basis with sequence context. if they can learn things like matching pairs of parens and quotes in certain contexts, it seems they could certainly learn things like extracting long strings of digits.

now what would be cool would be if they could generate regular expressions for the rules they're learning.