I think you only need one queue per actor? And then one worker per CPU core? I believe that how Erlang does it, and do millions of actors without any issues...
The way Erlang does it is to use buckets so it looks like a single queue to the user code but really is more like multiple queues behind the scene. Scales extremely well. It's certainly not "just moving a pointer to a piece of shared memory" though...