I have a paper that got denied but it was about using 2AFC sorting to do this instead of elo. It has a defined end unlike elo scores. The code is on my github and focuses on humans sorting images but basically if you have a python sort function, you put your comparison as the key instead of assigning the comparison a numeric score. Then the algorithm does the rest
It was actually done to counter Elo based approaches so there's some references in the readme on how to prove who's better. I haven't run this code in 5 years and haven't developed on it in maybe 6, but I can probably fix any issues that come up. My co-author looks to have diverged a bit. Haven't checked out his code. https://github.com/FrankWSamuelson/merge-sort . There may also be a fork by the FDA itself, not sure. This work was done for the FDA's medical imaging device evaluation division
I was going to mention this approach as well. The problem with the OP is that it has assumption bias and the entire chain is based on that assumption. It’s novel. But the original idea was to more evenly distribute scores so you can find real relevance and I think 2AFC is better. But I don’t have time to verify and post a paper about it.
It's probably because that's what we used, but nAFC has been my go-to since I first learned about it. Literally any time there's a ranking, even for dumb stuff like tier list videos on YouTube, they're too arbitrary. Ok you ranked this snack an 8/10. Based on what? And then they go back and say "actually I'm going to move that to a 7". AFC fixes all of that.