This is something that can be heavily skewed by the small N. If the people in both groups happen to closely match in their distributions of accuracy for this self-assessment then this would be correct. Larger N would make both groups much more likely to match the real distribution, which would in turn make them more likely to match each other.
Added to that, creatine could influence how close you feel to failure.
I'd like to see some data, for participants training until failure. That would still be somewhat subjective, but i'd argue, less so.