Unfortunately all experience shows that both programmers and compilers are much worse at parallelizing code than CPUs are. There have been many attempts at architectures that allowed compilers or programmers to express code in more parallel ways (VLIW architectures, Intel's Itanium, IBM's Cell used in the PS3) and they all categorically failed.
Successful processors offer a sequential execution model, and handle the parallelism internally. Even CUDA is designed somewhat like this: you express your code in a largely linear fashion, and rely on the NVIDIA-created compiler and on the GPU itself to run it in parallel.
CUDA is quite different. In CUDA you do not express parallel code in a linear fashion, a.k.a. sequential fashion, hoping that CUDA will determine the dependence chains and extract them from the sequential code and execute them in parallel.
The main feature of CUDA is that in order to describe an algorithm that is applied to an array, you just write the code that applies the algorithm to an element of that array, like writing only the body of a "for" loop.
Then the CUDA run-time will take care of creating an appropriate number of execution threads, taking into account the physical configuration of the available GPU, e.g. how many array elements can be processed by a single instruction, how many execution threads can share a single processing core, how many processing cores exist in the GPU, and so on. When there are more array elements than the GPU resources can process at once, CUDA will add some appropriate looping construct, to ensure the processing of the entire array.
The CUDA programmer writes the code that processes a single element array, informing thus the CUDA run-time that this code is independent of its replicas that process any other array element, except when the programmer references explicitly other array elements, which is normally avoided.
The task of a CPU able to do non-sequential instruction execution (a.k.a. out-of-order execution) and simultaneous initiation of multiple instructions (a.k.a. superscalar instruction execution) is quite different.
The main problem for such a CPU is the determination of the dependence relationships between instructions, based on examining the register numbers encoded in the instructions. Based on the detected dependencies and on the availability of the operands, the CPU can schedule the execution of the instructions, in parallel over multiple execution units.
There exists an open research problem, whether there is any better way to pass the information about the dependencies between instructions from the compiler, which already knows them, to the CPU that runs the machine code, otherwise than by using fake register numbers, whose purpose is only to express the dependencies, and which must be replaced for execution with the real numbers of the physical registers, by the renaming mechanism of the CPU.
Agreed, CUDA is very different. But still, it's another example where programmers just write sequential single flow of execution code, and it gets executed in parallel according to "external" rules.
Out-of-order execution doesn't imply parallelism, it was designed from the beginning to work around the input data availability problem.
In speculative execution and branch predictors, prefetch may seem just as a nice bonus, but given that nowadays CPU performance is largely bottlenecked by memory access, prefetch resulting from these techniques will often come out as the dominant performance factor.
It's still a form of parallelism, that could in principle be written into the program instead of being automatically implemented by the processor.
For example, in hand crafted assembly programs, it's sometimes common to know how long a fetch operation lasts, and manually schedule operations such that they can be executed in parallel with the fetch operation.
Theoretically a high level language could also be designed to expose this kind of logic to the programmer. A program in such a language would be expressed as a set of very very short threads that can be interleaved by the compiler given precise knowledge of instruction timers.
OoO came about after multi-issue architectures were starved for instructions to execute due to on-chip blockers like execution unit availability, data hazards, pipeline bubbles, register availability, branching, cache ports. You can call those input data availability problems but it's not availability from offchip memory. So in actual history, yes it was for parallelism (keeping multiple execution units busy).
OoO did have the side benefit from possibly executing past a few memory stalls but those were secondary. OoO reodering resources were sized for addressing the stalls from on-chip timescale things. Today the resources are bigger, but even bigger is the relative memory latency (how many insns you could execute in the time it takes to service a main memory fetch that your pipeline depends on).
Successful processors offer a sequential execution model, and handle the parallelism internally. Even CUDA is designed somewhat like this: you express your code in a largely linear fashion, and rely on the NVIDIA-created compiler and on the GPU itself to run it in parallel.