|
|
|
|
|
by krackers
178 days ago
|
|
Hm I did notice this bit "list of up to the last 50 targets you've done (so you don't get duplicate targets too frequently)" which seems to invalidate some of the methodology. If the target is never among the last 50 you've done that skews sample space a bit. The fact that this needs to be done also seems to imply the set of images is not that large... And this is worsened by the fact that the LLM-based auto scoring explicitly uses the last 10 as decoy targets >When you submit a session, the system collects your last 10 targets (including the current target) to create a pool of possible matches. A multimodal AI agent is presented with your complete session (including all drawings, text, and data) along with all 10 targets from the pool. The agent is instructed to analyze and rank the targets based on how well they match the session content. The protocol otherwise seems good, but the specific carveouts here would seem to bias results. The source for the judging is at https://github.com/Social-RV/comparative-judging which is the part which would need to be studied carefully. At first glance, it exposes raw filenames to the LLM which might bias things. The ranking logic also seems a bit sketchy, it does some tournament-style elimination thing which I haven't analyzed thoroughly but if decoys are eliminated in an earlier round it could bias things compared to just asking the LLM to order the 10 images based on similarity in a single-pass which is obviously unbiased. |
|
I think to counter this, you'd need to model your null hypothesis as the distribution that results when you have the LLM score a deliberately incorrect image against your target + dummies.