| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ghaff 1101 days ago
	I noticed that when I read the paper. I know it's hard to scale but I'd want to see competent TAs doing the grading. I also found the distribution of courses a bit odd. Some of it might be just individual samples but intro courses I'd expect to be pretty cookie cutter (for GPT) were fairly far down the list and things I'd expect to be really challenging had relatively good results.

1 comments

raunakchowdhuri 1101 days ago

Can attest that the distribution is odd from the test set that we sampled.

We've already run the compute to run the zero-shot GPT model on all of the datapoints in the provided test set. We're going through the process now of grading them manually (our whole fraternity is chipping in!) and should have the results out relatively soon.

I can say that, so far, it's not looking good for that 90% correct zero-shot claim either.

link

mquander 1101 days ago

Since you are here, when I was reading the paper I wondered -- when they show the "zero-shot solve rates", does that mean that they are basically running the same experiment code, but without the prompts that call `few_shot_response` (i.e. they are still trying each question with every expert prefix, and every critique?) It wasn't clear to me at a glance.

link