A tip: if you have say 4 cores, then using 4000 threads will most likely be slow due to lots of context switches. I say most likely because it depends on the details, but it's a safe guess.
That is part of the problem; if you have 4 cores, your program should be using 4 OS-threads. Your programming language's runtime should take care of distributing your 4000 lightweight/green threads to the 4 actual threads.
This is what e.g. Haskell does, and Go as well, I think.
> This is what e.g. Haskell does, and Go as well, I think.
This is what Erlang does by default, GHC >= 6.12 will do it when using `+RTS -N -RTS` and Go requires explicitly setting GOMAXPROCS, the runtime defaults to single-threaded (as far as I know, GOMAXPROCS still hasn't been retired) and there is no way to have it auto-detect the core count.
This is what e.g. Haskell does, and Go as well, I think.