Hacker News new | ask | show | jobs
by arawde 1690 days ago
This is something I've seen with copilot with market data.

I was creating a unit test in a Go codebase and I had dumped the JSON that I was going to be decoding at the top of the file, and when I started writing the assertions, Copilot was very quick to use the data from the JSON, with quite high accuracy, based solely on me typing which ticker I was going to assert against.

1 comments

I have seen that as well and it is impressive. But this is more than that. The code contained a string variable named accessToken with a string like "218276172612672-jash127hg27128h'(random data here, not the actual id), where 218276172612672 was the user id. When testing a function that required user id it not only suggested 218276172612672 but also did it with the full context participant.follow({twitterUserId:"218276172612672"})
Seems reasonable. What's your question? GPT-3 is just that good.
Seems totally reasonable to me too. It probably has just seen the pattern

``` str = "{{SOME_ID_HERE}}-jash127hg27128h"

participant.follow({twitterUserId:"{{SOME_ID_HERE}}"}) ```

This doesn't seem likely. No one would be generating it this way as access token is issued after oauth and I am unaware of any method to get the second half of the token without the first half. And given that in the same response that contains access token, user id is passed as well so there is no need to extract it from there.
hm. how many large integer literals are there in your code? it could just be learning that user ids are long strings of digits and is making a guess as to which long string of digits (based on some context, like sharing a line with "id" in it) might be the right one...
But that's the thing, if it took the whole string it would be fine, it extracted the exact right substring. There for 5 others in the same file. None I would have been able to distinguish from each other without context.
think of it this way, in the entire corpus of github, how often do you think that there are numeric identifiers that appear near terms like "id" where the numeric part is then used elsewhere with terms like "id" or terms that are frequently found near terms like "id"?

don't get me wrong, it's cool, but these models operate on a character by character basis with sequence context. if they can learn things like matching pairs of parens and quotes in certain contexts, it seems they could certainly learn things like extracting long strings of digits.

now what would be cool would be if they could generate regular expressions for the rules they're learning.

are you sure that "big number that starts with 2" wasn't just the greatest 32 bit 2s complement signed integer, which is often used for sentinel/testing values?
No it was oauth access token issued for a test user I created. Nothing special about the token.