Hacker News new | ask | show | jobs
by throwup238 654 days ago
I don't think that Mercury Prize table is a representative example because each column has an obviously unique structure that the LLM can key in on: (year) (Single Artist/Album pair) (List of Artist/Album pairs) (image) (citation link)

I think a much better test would be something like "List of elements by atomic properties" [1] that has a lot of adjacent numbers in a similar range and overlapping first/last column types. However, the danger with that table might be easy for the LLM to infer just from the element names since they're well known physical constants. The table of counties by population density might be less predictable [2] or list of largest cities [3]

The test should be repeated with every available sorting function too, to see if that causes any new errors.

[1] https://en.wikipedia.org/wiki/List_of_elements_by_atomic_pro...

[2] https://en.wikipedia.org/wiki/List_of_countries_and_dependen...

[3] https://en.wikipedia.org/wiki/List_of_largest_cities#List

4 comments

Additionally, using any Wiki page is misleading, as LLMs have seen their format many times during training, and can probably reproduce the original HTML from the stripped version fairly well.

Instead, using some random, messy, scattered-with-spam site would be a much more realistic test environment.

Also it can get partial credit on some of these questions without feeding in any data at all.
Good points. But I feel like even with the cities article it could still ‘cheat’ by recognising what the data is supposed to be and filling in the blanks. Does it even need to be real though? What about generating a fake article to use as a test so it can’t possibly recognise the contents? You could even get GPT to generate it, just give it the ‘Largest cities’ HTML and tell it to output identical HTML but with all the names and statistics changed randomly.
> You could even get GPT to generate it

This isn't a good idea, if you want a fair test. See https://gwern.net/doc/reinforcement-learning/safe/2023-krako..., specifically https://arxiv.org/abs/1712.02950.

thanks a lot for the feedback! you're right, this is much better input data. I'll re-run the code with these tables!
Also - is there a chance GPT is relying on it's training data for some questions? i.e. you don't even need to give it the table.

To be sure - shouldn't you be asking questions based on data that is guaranteed not to be in it's training?

LLMs are trained on Wikipedia (and, since it's high quality open license data, probably repeatedly), so this test is contaminated.