Once context gets long with the limited tokenization memory we currently have it seems to go insane rather fast. Would like to test on the 32k model to see how the same prompting differs.
I thought I was using GPT-4, but it appears that's only available to paying customers so far? The question I was getting repeated wrong answers to was "Can you make up a palindrome that starts with 'Dude'?" If you want to try that with GPT-4 I'd be interested to see if it can do it now or at least knows to say "I can't".