| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by arawde 1690 days ago
	This is something I've seen with copilot with market data. I was creating a unit test in a Go codebase and I had dumped the JSON that I was going to be decoding at the top of the file, and when I started writing the assertions, Copilot was very quick to use the data from the JSON, with quite high accuracy, based solely on me typing which ticker I was going to assert against.

1 comments

harshitaneja 1690 days ago

I have seen that as well and it is impressive. But this is more than that. The code contained a string variable named accessToken with a string like "218276172612672-jash127hg27128h'(random data here, not the actual id), where 218276172612672 was the user id. When testing a function that required user id it not only suggested 218276172612672 but also did it with the full context participant.follow({twitterUserId:"218276172612672"})

link

woah 1690 days ago

Seems reasonable. What's your question? GPT-3 is just that good.

link

kevinsundar 1690 days ago

Seems totally reasonable to me too. It probably has just seen the pattern

``` str = "{{SOME_ID_HERE}}-jash127hg27128h"

participant.follow({twitterUserId:"{{SOME_ID_HERE}}"}) ```

link

harshitaneja 1690 days ago

This doesn't seem likely. No one would be generating it this way as access token is issued after oauth and I am unaware of any method to get the second half of the token without the first half. And given that in the same response that contains access token, user id is passed as well so there is no need to extract it from there.

link

a-dub 1690 days ago

hm. how many large integer literals are there in your code? it could just be learning that user ids are long strings of digits and is making a guess as to which long string of digits (based on some context, like sharing a line with "id" in it) might be the right one...

link

harshitaneja 1690 days ago

But that's the thing, if it took the whole string it would be fine, it extracted the exact right substring. There for 5 others in the same file. None I would have been able to distinguish from each other without context.

link

a-dub 1690 days ago

think of it this way, in the entire corpus of github, how often do you think that there are numeric identifiers that appear near terms like "id" where the numeric part is then used elsewhere with terms like "id" or terms that are frequently found near terms like "id"?

don't get me wrong, it's cool, but these models operate on a character by character basis with sequence context. if they can learn things like matching pairs of parens and quotes in certain contexts, it seems they could certainly learn things like extracting long strings of digits.

now what would be cool would be if they could generate regular expressions for the rules they're learning.

link

a-dub 1690 days ago

are you sure that "big number that starts with 2" wasn't just the greatest 32 bit 2s complement signed integer, which is often used for sentinel/testing values?

link

harshitaneja 1690 days ago

No it was oauth access token issued for a test user I created. Nothing special about the token.

link