This thread reminds of a competition I once joined where we were supposed to fine-tune an LLM to fill out trivia answers, and we were expressly disallowed from training on the validation set.
However: we were allowed to pick any base model in a given repo. All of the teams that “won” did so for the same reason: they had all picked the same base model (whereas a majority of teams picked the given default), presumably the one that had at some point been trained on the most favorable data for this particular challenge.
It was quite silly. Had everyone had the same base model we’d have a bit more of an interesting problem (more around NLP and alignment than picking the ‘best’ model).
Well, in this case we're literally asking if the model can remember new facts, not generalize, so seems like a legit first level test; second level might be, can it answer a question incorporating that specific knowledge in a broader question.
However: we were allowed to pick any base model in a given repo. All of the teams that “won” did so for the same reason: they had all picked the same base model (whereas a majority of teams picked the given default), presumably the one that had at some point been trained on the most favorable data for this particular challenge.
It was quite silly. Had everyone had the same base model we’d have a bit more of an interesting problem (more around NLP and alignment than picking the ‘best’ model).