Any language that arranges data in arrays or large structs does well on modern machines, especially with vector and SIMD extensions. To be fair to Forth, there exist machines that are well-suited to running "threaded code"[1], it's just that they are not machines that are commonly available today.
I don't buy this. I understand the use of your term 'threaded', but these are unconditional and therefore can be incorporated into the instruction pipeline with little or no overhead. Here's a very old SO post, CPus won't have got worse since then
"But in general, on modern processors, there is minimal cost for an unconditional jump. It's basically pretty much free apart from a very small amount of instruction cache overhead. It will probably get executed in parallel with neighbouring instructions so might not even cost you a clock cycle. "
[1] "threaded" has nothing to do with multithreaded: https://en.wikipedia.org/wiki/Threaded_code