I think you can do it with just one kernel mode thread for all processes, using one (or more) pages of memory per process/thread. The kernel process can read all the pages, but pages can only be read by their respective processes.
It looks like this is not what the article's implementation does, but I think it would be possible.
It looks like this is not what the article's implementation does, but I think it would be possible.