Hacker News new | ask | show | jobs
by nsagent 3 hours ago
Do you also think LLM leaderboards accurately reflect the capabilities of the models being tested? If you do, then I can easily point you to numerous academic papers pointing out the various flaws in many leaderboards (from poorly designed benchmarks like bABI and the original SQuAD, to data contamination, and more).

In that same way, any test, including the SAT and GRE have flaws. They can be gamed in ways similar to LLM leadeboards: test prep makes you better at them. That's one of the main reasons universities moved away from SAT; they were afraid that it disenfranchised lower socioeconomic status students (and it does to some degree). The issue is that the test is positively correlated with success in an undergraduate program, so they threw out the baby with the bathwster. The real issue is that the SAT is not able to distinguish the capabilities among students to the degree it purports to.

And if you want an anecdote to match all yours, the first time I took a GRE practice test, I got a 3 on the writing. Not because I'm poor at writing, but because I didn't really know what they were looking for. After reading a test prep book, I got a 4.5 on my next practice test and a 5 on my final practice test. When I finally took the actual GRE, I got 6 on the analytical writing. Trust me, nothing changed in my writing ability over that time. In fact, I didn't even practice the skill except through those three practice tests. Clearly the test was not capable of determining my real ability to make an argument; it merely tested my ability to adapt my writing to what was supposedly being tested.

Interestingly, the vast majority of universities that got rid of the GRE requirements for PhD programs are not going back on that. Turns out that the students with the highest GRE scores are the ones most likely to drop out of their STEM PhD. [1]

[1]: https://journals.plos.org/plosone/article?id=10.1371/journal...

2 comments

I took the GREs, I don't recall a writing section.

Anyhow, the questions were all about freshman engineering knowledge.

There are three major parts of the modern GRE: Verbal, Quantitative, and Analytical Writing. You could easily look that up, or ask if you didn't know.

Responding off the cuff without any reflection on the comment you're responding to doesn't move the conversation forward in any meaningful way. It just comes across as disrespectful.

Do you think that LLM leaderboards don’t? Do you think a Llama 3 is going to beat an Opus 4.7 on any leaderboard?

The real issue is that standardized tests disenfranchise lower SES students less than any other metric.

Everyone who takes the SAT has to sit in the same room for the same amount of time answering the same questions. You can’t just pay someone else to take it for you (like essays) or select which difficulty level you take (like going to a prep school with grade inflation), or luck out in who your parents know (like recommendation letters).

Some may have better opportunities to learn the material, but, at the end of the day, you have to actually learn the material. There’s no getting around that.

As your own GRE anecdote shows: A little studying with some inexpensive books makes all the difference. Unless things have radically changed, a couple SAT or GRE test prep books are significantly less expensive than just one college textbook.

Bluntly, the reason SATs are better correlated to college performance than other measures are because of the reasons I mentioned. They strip away most of the privilege of coming from a high-SES family.