For the same reason that C/C++ allow inline assembly? Languages come in roughly 3 speeds. Slow (e.g. python/R), mostly not slow (e.g. Java/Go), and not slow (e.g. C/Rust). If you want actually fast code (e.g. the speed of BLAS/FFTW etc) you need the combination of a not slow language, code generation, and often hand-coded assembly for the most performance critical parts.
I noticed you didn't mention Julia explicitly this time because when you outline the abstractions like this it seems silly to claim something about Julia magically solves the position and purpose of these layers. I can write a PySpark job based on a tutorial that would run circles around a single core Julia process that was designed with contradictory requirements. I just don't see how Julia gets away with claiming it solves all of this in the first page of their documentation without a ton of qualifiers... Except to say that is Julia that's what they do, they make bold claims that obfuscate what performance is and where it comes from.
To be explicit about where julia fits in here, Julia is a "not slow" language (you could make an argument for it being on the faster end of "mostly not slow" due to GC) that also has enough high level features (higher order functions, macros, memory management, general ease of use) to work as a high level language. You absolutely can write a distributed python codebase that runs faster than single core julia, but doing so will likely be harder than writing the distributed/threaded Julia code that is way faster than PySpark.
Yeah citation needed on that one, but it is a dumb hypothetical on my part to illustrate the problem. Another hypothetical, how do you get junior people to support your inline ASM? If it makes this easy to do, it makes the technical debt that much more rampant.
You seem really hung up on this inline asm thing. It's not like most julia code is just inline assembly. It's something that at a rough estimate, .5% of packages use to let you squeeze out the final drops of performance that then gets wrapped in an API that looks like normal julia code. This isn't any different from C/C++ which also in some low level code bases will have calls to compiler directives or inline assembly.
The way it is brought up in arguments about how if what you're doing in Julia is slow you can put in the ASM directly in arguments leaves me with the impression that is nonetheless a core part of the "faster than <x>" claims at least. And that's a cop out.