This is particularly impressive for Elixir, which is not a language that is a particular focus of GPT-4. I imagine the accuracy for Python is extremely good. Maybe near perfect for this kind of benchmark if allowed to see error messages.
It's also possible GPT-4 is better at writing Elixir since there are less beginners/students writing Elixir code and polluting the training data with bad practice or faulty code.