There really is a lot of cherry picking, etc. going on in this area. Papers released without code and weights or even data make reproduction and validation nearly impossible.
Yeah it's stunning to me that people can apparently run experiments with code, produce results with code, write a paper about that code, and then release poorly written prose in a garbled way without also releasing that code (or at the very least releasing a video demonstrating results).
There's also the problem that most complex neural networks are highly sensitive to initial weights. My friends and I have frequently tried to reproduce famous papers and it's remarkable how often getting the initial settings nearly exactly correct is the key to achieving the targeted bench mark.
This is a problem because cherry picking is essentially built into the frame work.
If I was building ranking algorithm and just kept picking a random seed to arbitrarily sort a list of numbers until it was correct, most people would consider that obviously cheating. However if I did the same thing but stuck 3 dense matrices between the seed and the list to be ranked it would considered AI.