It is interesting because you can't really A/B test the response time with the different text, since the response time includes the human processing time of the chess problem.
Sure you can. You're just adding two (probably normal-ish) distibutions: the distribution of human response times for processing the chess move, and the distribution of human response times for parsing the instructions.
You are right, of course, and I knew I was going to regret the use of even the toned-down term "probably normal-ish". The question is whether the one can meaningfully ask which text is faster to parse, despite the "noise" the comes from the time to actually perform the chess move. I suspect the answer will be 'yes'.
I think the answer will be 'no' as solving a chess problem is a learned skill and not as innate as reading and understanding text and the time delta of comprehension between two similar text passages << time to solve chess problem distributed between users of low and high skill.
Why would the distribution of user skill differ across test groups? There's no reason it should. The expected time to solve the puzzle itself would be the same across both groups.
Are you lichess creator? Just wanted to congratulate you for the great peace of software you got there. IMHO is one of the best multi-player chess platforms out there. It is so straightforward and UX is amazing. The "analyze game" link after the end of the game is also very good.
Even though I am only a ~1200 ELO player, I play almost daily and enjoy your platform a lot :)
It is interesting because you can't really A/B test the response time with the different text, since the response time includes the human processing time of the chess problem.