In reality probably authors of
the papers understood that the OpenAI team artificially readjusted these biases through RLHF and that there is nothing to find there, except that it still works when the words are written with typos because no manual examples of “redressing biases” have been provided with such typos.
football -> fotbal
cousin -> cosin
teacher -> teachr
You can’t fit everything in 160 chars.