Hacker News new | ask | show | jobs
by gpderetta 658 days ago
Yes, the overhead in the logging thread is what this is trying to minimize. The background thread is considered "free". This sort of async logging is a common setup for some class of applications.

And yes, it boils down to writing data to a message queue. Most of the overhead is probably the call to the hardware timestamp counter.

1 comments

In my logging code I wrote that is basically a SPSC ring buffer, I use some RDTSC assembly and at startup I calculate the frequency and epoch offset. It has a throughput of around 30 cycles.

That's already ~10 ns, so I'm not sure how they are getting their numbers. If they are timestamping the data when the background thread gets to it that would be pushing even more work to it. It guessing they do or else they could potentially be logging out of order data with multiple threads.

In a similar SPSC ring buffer setup I see something like 1.8ns per 16B entry without RDTSC (our design has a cached thread-local tsc that is updated periodically, so this is vaguely reasonable) or 7.6ns per entry with RDTSC (approximate RDTSC cost of 4.8ns). This is on a VM on some sort of Skylake CPU (maybe Skylake X).
They are likely timestamping on the logging thread. Possibly they are just running at higher frequency.
You could store the current date and time in a global variable and have the producers just read it atomically. The consumer thread would then update it periodically.

Timestamps will be somewhat inaccurate but it may help performance.

that's what the vdso version of clock_gettime does. If you use one of the *_COARSE clocks it will only update periodically and be much faster, but that means like 15 milliseconds of log messages will all have the same timestamp.

The fastest for nanosecond precision (bonus is this is even sub nanosecond) is just to store the return value of RDTSC and let the background thread figure it all out. You don't even need to precalcuate the freq or epoch offset. Just write a couple logging messages of the rdtsc value and CLOCK_REALTIME and let the post processing figure it out.

To cut down on I/O each log message's timestamp can just be an offset from the last even.

If you are willing to push a lot of work to the background thread and even more to the post processsing step, you really don't need to do very much.

> hat's what the vdso version of clock_gettime does. If you use one of the *_COARSE clocks it will only update periodically and be much faster, but that means like 15 milliseconds of log messages will all have the same timestamp.

Not sure it matters a lot of to have multiple messages with the same timestamp, since they were added in order you still know which one is older, the problem might arise when you send those logs to a remote place and the order of insert is discarded and the timestamp is used instead.

I assume that when you use a single thread with a queue / ring buffer the order of insertion is kept.

FWIW, I have relied on the accuracy of log timestamps well into the low microseconds.

If you have an event loop, it might be acceptable to sample the counter once per iteration, but less than that it becomes too lossy.

probably in this case it's important to use some kind of synthetic timestamping to preserve the ordering (for example, for a 5.123 [ms] timestamp one can add ~1000000 ns timestamps, so let's say are a thousand entries that need to be ordered, one can then represent them as 5.123000[n] ... and the "000" part is just a silly in-band way to give a hint to someone who will later examine the logs)
since you aren't going to be writing a message per nanosecond, you can always just do `last nanos = max(last nanos + 1, cur nanos)` and then use last nanos for the timestamp. you can even do it in rdtsc ticks and get 1/3 of nano values. Obv the clock isn't nearly that accurate, but it lets you use those fractional nanos to ensure a strictly increasing ordering.
If rdtsc is a cycle counter then this also assumes a constant clock rate.
rdtsc has been "invariant" for a long time.
Oh TIL, I always thought it was a plain cycle counter
It was at first. Until it became obvious that it was useless like that.

You had the OS down-clocking cores to avoid CPUs cooking themselves, causing time to increase an inconsistent rates. Your app queried the clock rate, but unless it kept querying it, the number of cycles was meaninglesss.

ie:

  (CPU rate changed to 1GHz)
  Query Rate, 1GHz.
  (CPU rate changed to 700MHz)
  Read start time
  Read end time
  convert delta into time, using 1GHz rate... not accurate.
or

  (CPU rate set to 1GHz)
  Read start time
  Read end time
  (CPU rate set to 700MHz)
  Query rate, 700Mhz.
  convert delta into time, using 700MHz rate... not accurate.
Next you had multicore, and the OS would down-clock or turn off some idle cores, again, perhaps they were throttled to avoid overheating. Again, the clock rate would change dynamically, but per core. Not only did time not increase at a constant pace, but if a thread jumped between cores, time could appear to go backwards!

It was a mess. Windows introduced a call to get time in a safe way, as they knew when they changed CPU frequencies. Intel fixed the hardware to use a constant clock after a just a few models, but the damage is done... It's invariant...usually, but not for certain.

> It's invariant...usually, but not for certain.

You can verify it's invariant through one of the cpuid leaves. There's a feature bit.