The Python function is implemented in C and uses a faster algorithm [1], and this particular factorial is so small they put it in a lookup table [2]. It is a strange and very unequal choice for a demo.
My guess is that the slight overhead of interacting with mojo led to this speed discrepancy, and if a higher factorial (that was within the overflow limits etc) was run, this overhead would become negligible (as seen by the second example). Also similar to jax code being slower than numpy code for small operations, but being much faster for larger ones on cpus etc.
[1] https://github.com/python/cpython/blob/0d9d48959e050b66cb37a...
[2] https://github.com/python/cpython/blob/0d9d48959e050b66cb37a...