Hacker News new | ask | show | jobs
by binarycrusader 3392 days ago
I prefer the way Solaris solved this problem:

1) first, by eliminating the need for a context switch for libc calls such as gettimeofday(), gethrtime(), etc. (there is no public/supported interface on Solaris for syscalls, so libc would be used)

2) by providing additional, specific interfaces with certain guarantees:

https://docs.oracle.com/cd/E53394_01/html/E54766/get-sec-fro...

This was accomplished by creating a shared page in which the time is updated in the kernel in a page that is created during system startup. At process exec time that page is mapped into every process address space.

Solaris' libc was of course updated to simply read directly from this memory page. Of course, this is more practical on Solaris because libc and the kernel are tightly integrated, and because system calls are not public interfaces, but this seems greatly preferable to the VDSO mechanism.

4 comments

This is precisely what the vDSO does. The clocksources mentioned explicitly list themselves as not supporting this action, hence the fallback to a regular system call.
Not quite; vdso is a general syscall-wrapper mechanism. The Solaris solution is specifically just for the gettimeofday(), gethrtime() interfaces, etc.

The difference is that on Solaris, since there is no public system call interface, there's also no need for a fallback. Every program is just faster, no matter how Solaris is virtualized, since every program is using libc.

There's also no need for an administrative interface to control clocksource; the best one is always used.

Not quite. The vDSO provides a general syscall-wrapper mechanism for certain types of system call interfaces. It also provides implementations of gettimeofday clock_gettime and 2 other system calls completely in userland and acts precisely as you've described.

Please see this[1] for a detailed explanation. For a shorter explanation, please see the vDSO man page[2]. Thanks for reading my blog post!

[1]: https://blog.packagecloud.io/eng/2016/04/05/the-definitive-g... [2]: http://man7.org/linux/man-pages/man7/vdso.7.html

I'm aware of the high level about VDSO implementation, but I would still say that the Solaris implementation is more narrowly focused and as a result does not have the subtle issues / tradeoffs that VDSO does.

Also, I personally find VDSO disagreeable as do others although perhaps not in as dramatic terms as some:

https://mobile.twitter.com/bcantrill/status/5548101655902617...

I think Ian Lance Taylor's summary is the most balanced and thoughtful:

Basically you want the kernel to provide a mapping for a small number of magic symbols to addresses that can be called at runtime. In other words, you want to map a small number of indexes to addresses. I can think of many different ways to handle that in the kernel. I don't think the first mechanism I would reach for would be for the kernel to create an in-memory shared library. It's kind of a baroque mechanism for implementing a simple table.

It's true that dynamically linked programs can use the ELF loader. But the ELF loader needed special changes to support VDSOs. And so did gdb. And this approach doesn't help statically linked programs much. And glibc functions needed to be changed anyhow to be aware of the VDSO symbols. So as far as I can tell, all of this complexity really didn't get anything for free. It just wound up being complex.

All just my opinion, of course.

https://github.com/golang/go/issues/8197#issuecomment-660959...

> Not quite; vdso is a general syscall-wrapper mechanism.

It's not. On 32-bit x86, it sort of is, but that's just because the 32-bit x86 fast syscall mechanism isn't really compatible with inline syscalls. Linux (and presumably most other kernels) provides a wrapper function that means "do a syscall". It's only accelerated insofar as it uses a faster hardware mechanism. It has nothing to do with fast timing.

On x86_64, there is no such mechanism.

> It's true that dynamically linked programs can use the ELF loader. But the ELF loader needed special changes to support VDSOs. And so did gdb. And this approach doesn't help statically linked programs much.

That's because the glibc ELF loader is a piece of, ahem, is baroque and overcomplicated. And there's no reason whatsoever that vDSO usage needs to be integrated with the dynamic linker at all.

I wrote a CC0-licensed standalone vDSO parser here:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux....

It's 269 lines of code, including lots of comments, and it works in static binaries just fine. Go's runtime (which is static!) uses a vDSO loader based on it. I agree that a static table would be slightly simpler, but the tooling for debugging the vDSO is a heck of a lot simpler with the ELF approach.

This all seems predicated on the fact that Solaris doesn't support direct system calls and the fact that they ship their kernel and libc as one unified whole (like BSDs). Solaris is free to update the layout of their shared data structures whenever they want[1].

Because Linux kernel interfaces are distinct and separate from libc, and given Linus' policy on backwards compatibility, Linux had two choices for an _interface_: 1) export a data structure to userland that could never change, or 2) export a code linking mechanism to userland that could never change. In that light the latter choice seems far more reasonable.

[1] The shared data structures for this particular feature. There are other kernel data structures that leak through the libc interface and for which Solaris is bound to maintain compatibility.

The fallback isn't there because there's a public system call interface: the fallback is there because some of the kernel-side implementations of gettimeofday() (in particular, the Xen one) currently require the process to do a proper syscall.

This is separate from the fact that the gettimeofday() system call still exists too, which is a backwards-compatibility issue. The overwhelming majority of Linux applications do their system calls through libc too, so this doesn't affect them.

For those actually curious about the implementation on solaris/illumos, heres a quick rundown (from looking at current illumos source):

- comm_page (usr/src/uts/i86pc/ml/comm_page.s) is literally a page in kernel memory with specific variables that is mapped (usr/src/uts/intel/ia32/os/comm_page_util.c) as user|read-only (to be passed to userspace, kernel mapping is normal data, AFAICT)

- the mapped comm_page is inserted into the aux vector at AT_SUN_COMMPAGE (usr/src/uts/common/exec/elf/elf.c)

- libc scans auxv for this entry, and stashes the pointer it containts (usr/src/lib/libc/port/threads/thr.c)

- When clock_gettime is called, it looks at the values in the COMMPAGE (structure is in usr/src/uts/i86pc/sys/comm_page.h, probing in usr/src/lib/commpage/common/cp_main.c) to determine if TSC can be used.

- If TSC is usable, libc uses the information there (a bunch of values) to use tsc to read time (monotonic or realtime)

Variables within comm_page are treated like normal variables and used/updated within the kernel's internal timekeeping.

Essentially, rather than having the kernel provide an entry point & have the kernel know what the (in the linux case) internal data structures look like, here libc provides the code and reads the exported data structure from the kernel.

So it isn't reading the time from this memory page, it's using TSC. In the case of CLOCK_REALTIME, corrections that are applied to TSC are read from this memory page (comm_page).

So it isn't reading the time from this memory page, it's using TSC. In the case of CLOCK_REALTIME, corrections that are applied to TSC are read from this memory page (comm_page).

This summary only applies to Illumos. The Solaris implementation diverged significantly around build 167 (2011) long after the last OpenSolaris build Illumos was based on (build 147). It changed again significantly in 2015.

I believe Circonus contributed an alternate implementation that does some of the same things as Solaris in 2016:

https://www.circonus.com/2016/09/time-but-faster/

With that said, you are correct that whether or not it will read from a memory page instead depends on which interfaces you are using (e.g. get_hrusec()) and other subtle details.

So the only things I'm seeing in the linked circonus code that differ from illumos:

1. no use of a kernel supplied page, determines skew/etc itself in userspace 2. stores information on a per-cpu level, and tries to execute cpuid on the same cpu as rdtsc.

I'm presuming you're talking about #2 (and #1 is just due to the linked item being a library without kernel integrations)? Perhaps with some more kernel support so that the actual cpu rdtsc ran on can be reliably determined?

This still doesn't clarify the part about "shared page in which the time is updated" and is read from. This statement appears to imply TSC is not (necessarily) used (otherwise I'd categorize it under "uses values from memory page to fixup TSC", like Illumos' current implimentation). I'm still not sure how that can be done reasonably.

Is there just a 1 micro second timer running whenever a user task is being executed that is bumping the value? Wouldn't that be quite a bit of overhead? Or some HW trick? I mean, you could generate a fault on every read, and have the kernel populate the current data, but that seems just as bad as a syscall.

You just described the old method Linux used that was vulnerable to info leaks iirc and why it now a vDSO
The Solaris method doesn't have the problem the other implementation did.
How does solaris find the page? If it's mapped to a fixed address then it does have that problem.
The default is to map the shared page to a randomized, available address within the process space.

libc gets the address of the page by looking it up in an auxiliary vector table that belongs to the process.

Sounds like the clock resolution would be limited to the ticket interrupt in this case - how does it handle high resolution timers?