Spelling challenges are always going to be inherently difficult for a token-based LM. It doesn't actually "see" letters. It's not a good test for performance (unless this is actually the kind of question you're going to ask it regularly).
I've found it's more reliable to ask it to write some javascript that returns how many letters are in a word. Works even with Llama 7b with some nudging.
Falcon fails. GPT-3.5 also fails this test. GPT-4 gets it right. I suspect that GPT-4 is just large enough to have developed a concept of counting, whereas the others are not. Alternatively, it's possible that GPT-4 has memorized the answer from its more extensive training set.