I don't recall the link but there was a github repo with comparisons of CFFI implementations in different languages, and from what i remember Python was 'bout 3 orders of magnitude slower than, say, Lua or Ocaml
I'm not sure to the specific benchmark from which you are referring to, but it appears it was measuring the speed of Python looping and not the speed of the FFI.
As far as I can tell the FFI itself is not expensive as long as the underlying type does not have to be converted. But if you expect to call it millions of times a second you're going to have trouble. The solution is to move the loop inside the C code.
For example, suppose you want to FFT a bunch of signals. You don't repeatedly call FFT for each of the signals - you pass the entire data structure to the C code.
If you want to convert to Python lists its is going to take time. Not sure about Python arrays.