| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by godelski 794 days ago

This is not quite accurate, but complex because measurement is hard. The things they are being tested on are almost surely within the dataset. Let's take the bar exam for instance. Sure, we don't know what's in GPT data, but we know it has reddit, and we know reddit has many similar if not exact questions on it. We know that the first GPT4 did not have good semantic similarity matching because they just used a 3 substring matching on 50 chararcters (Appendix C) and they only consider the false positive nature. Then there's this line...

  The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.

But my favorite is the HumanEval. I'll just remind everyone that this was written by 60 authors, mostly from OpenAI

  We evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. ... __It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.__

The problems? Well they're leetcode style... Can you tell me you can write leetcode style questions that

  Human Eval 2

  Prompt:
  def truncate_number(number: float) -> float: """ Given a positive floating point number, it can be decomposed into and integer part (largest integer smaller than given number) and decimals (leftover part always smaller than 1). Return the decimal part of the number. >>> truncate_number(3.5) 0.5 """ 

  Solution:
  return number % 1.0 

  Human Eval 4

  Prompt:
  from typing import List def mean_absolute_deviation(numbers: List[float]) -> float: """ For a given list of input numbers, calculate Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the average absolute difference between each element and a centerpoint (mean in this case): MAD = average | x - x_mean | >>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """ 

  Solution
  mean = sum(numbers) / len(numbers) 
  return sum(abs(x - mean) for x in numbers) / len(numbers)

You really want to bet that that isn't on github? Because I'll bet you any dollar amount you want that there are solutions in near exact form that are on github prior to their cutoff date (Don't trust me, you can find them too. They're searchable even). Hell, I've poisoned the dataset here!

LLMs are (lossy) compression systems. So they're great for information retrieval. And a lot of what we consider intelligence (and possibly even creativity) is based on information retrieval. Doesn't mean these things are any less impressive but just a note on how we should be interpreting results and understanding the limitations of our tools. Measuring intelligence is a really difficult thing and we need to be aware that the term isn't universally agreed upon and so people are often talking past one another and also some people are conflating the differences as if they are the same.