|
|
|
|
|
by nsagent
3 hours ago
|
|
Do you also think LLM leaderboards accurately reflect the capabilities of the models being tested? If you do, then I can easily point you to numerous academic papers pointing out the various flaws in many leaderboards (from poorly designed benchmarks like bABI and the original SQuAD, to data contamination, and more). In that same way, any test, including the SAT and GRE have flaws. They can be gamed in ways similar to LLM leadeboards: test prep makes you better at them. That's one of the main reasons universities moved away from SAT; they were afraid that it disenfranchised lower socioeconomic status students (and it does to some degree). The issue is that the test is positively correlated with success in an undergraduate program, so they threw out the baby with the bathwster. The real issue is that the SAT is not able to distinguish the capabilities among students to the degree it purports to. And if you want an anecdote to match all yours, the first time I took a GRE practice test, I got a 3 on the writing. Not because I'm poor at writing, but because I didn't really know what they were looking for. After reading a test prep book, I got a 4.5 on my next practice test and a 5 on my final practice test. When I finally took the actual GRE, I got 6 on the analytical writing. Trust me, nothing changed in my writing ability over that time. In fact, I didn't even practice the skill except through those three practice tests. Clearly the test was not capable of determining my real ability to make an argument; it merely tested my ability to adapt my writing to what was supposedly being tested. Interestingly, the vast majority of universities that got rid of the GRE requirements for PhD programs are not going back on that. Turns out that the students with the highest GRE scores are the ones most likely to drop out of their STEM PhD. [1] [1]: https://journals.plos.org/plosone/article?id=10.1371/journal... |
|
Anyhow, the questions were all about freshman engineering knowledge.