The gettimeofday(3) vDSO is pure-userspace code. Why not, then, a futex(3) vDSO that does a while + compare_and_swap(2) in userspace, but then contains a real call to the futex(3) syscall?
This part of the optimisation is well-known and had been for years before futex was invented, so there's no need to provide it as a vDSO or any other special work, userspace can (and was ready to) do that part anyway.
The futex is clever because previously the heavyweight locking is in the kernel, but all you actually needed was this very flimsy wait feature. BeOS for example, has a "Benaphore" construction where you initialise an expensive heavyweight lock but you try the userspace trick before needing to actually take it. So that looks superficially just as good as a futex...
... and then you realise oh, the underlying locks need kernel RAM to store their state, so the OS can only allow each process to have so many of them and thus you can't have so many Benaphores, they were expensive after all. But a futex doesn't use any kernel resources when you aren't calling the OS, Linux doesn't mind if your program uses a billion futexes, that's fine, any particular thread can only be waiting on one of them, and Linux is only tracking the waiting.
> This part of the optimisation is well-known and had been for years before futex was invented, so there's no need to provide it as a vDSO or any other special work, userspace can (and was ready to) do that part anyway.
I guess, to me, the semantics of futex(3) on their own seem ill-formed, without the while loop + cmpxchg being part of them. It feels like the user shouldn't have access to raw "low-level" calls to futex(..., FUTEX_WAIT), since there's only one useful way to use those calls, and it's in combination with other code, where that code is theoretically possible to get wrong. It's an API that doesn't cleanly encapsulate its own concerns.
I suppose I'm used to thinking with "building reusable crypto"-tinted glasses: with crypto libraries, you never want to expose a primitive that needs to be used a certain way to be sound; because people are inevitably going to use it wrong, and that's inevitably going to result in exploitation. Instead, you can just expose a higher-level primitive, that inherently can only be used that one correct way, with no means to ask it to do anything else.
Of course, there's nothing inherently dangerous about calling futex(..., FUTEX_WAIT) without the associated code. (You just get a race condition. But a race condition in userspace doesn't corrupt the kernel or anything.) So I suppose this kind of thinking is meaningless here.
For a long time the user in fact did not have (easy) access to it. glibc has started to provide a syscall wrapper only very recently, before you had to use syscall directly.
The reason is that the CAS is not part of the interface but it is left to the implementor is that the specific logic is very much application specific and in fact there is not only one way to use it. As discussed elsethread mutexes are only one of the many application of futexes.
What you described is what the phread_mutex_lock() does exactly, which is in user space. Application programmers don't deal with futex directly, they call phread_mutex_lock/phread_mutex_unlock.
Futex puts the thread to sleep and wakes it up. Accessing the OS scheduler requires kernel access anyway. The memory access of the flag can be done in user or kernel space.
The futex is clever because previously the heavyweight locking is in the kernel, but all you actually needed was this very flimsy wait feature. BeOS for example, has a "Benaphore" construction where you initialise an expensive heavyweight lock but you try the userspace trick before needing to actually take it. So that looks superficially just as good as a futex...
... and then you realise oh, the underlying locks need kernel RAM to store their state, so the OS can only allow each process to have so many of them and thus you can't have so many Benaphores, they were expensive after all. But a futex doesn't use any kernel resources when you aren't calling the OS, Linux doesn't mind if your program uses a billion futexes, that's fine, any particular thread can only be waiting on one of them, and Linux is only tracking the waiting.