Hacker News new | ask | show | jobs
by jjk166 1063 days ago
> replace the low bits with the hash of them, concatenated with the value of the CPU cycle counter (`RDTSC` on x86)

you're concatenating two values and then taking the hash of the combination, ie:

hash(low bits + CPU cycle counter)

2 comments

But a hash() destroys information.

When you are trying to reduce collisions, why would you use a function that is known to introduce collisions?

Because the raw value from the timestamp will be very low entropy and have the short scale variation concentrated in just a few bits. A hash not just destroys information, it also creates entropy by mixing that information over all the bits that come out of it. And using 64 bit hash replacing a 64 bit nanosecond counter that has short term entropy of less than 16 bits, you're in fact reducing the likelihood of a collision by a factor of 2^48.
The hash, in this case, is just one deterministic way to shorten the CPU counter to few bits which can then be used to increase the entropy of the timestamp by replacing the timestamps stale bits. What's being asked here is not why use some compressed bits of the CPU counter increases the entropy of the timestamp overall. Rather, why you'd use a hash of the entropy providing information to do this since hashes allow for many:1 mappings (i.e. allow for decreases in entropy) while never providing better than 1:1 mappings (i.e. preserving entropy). At best, the hash is preserving the entropy of the timestamp and counter bits and, at worst, it is throwing some away. The question that follows: is there a better way that preserves the entropy of the inputs without the risk of throwing some away and, if so, was there some other reason to still use the hash? This is what I took amelius to be asking.

Knowing the timestamp delta is ~30ns then even a 1 THz processor would only execute 30,000 cycles before said timestamp increases and the outcome stays unique anyways. From that perspective, taking the low 16 bits of the cycle counter is already guaranteed to help produce perfectly unique timestamps, and without the risk of introducing hash collisions. It's also significantly easier to compute and now monotonic* now, whereas hashes were neither. From that perspective, I too don't see what value the hash is supposed to add to the information being fed to it in this case.

In threaded/distributed/parallel scenarios you may wish to throw the lane/core/node number in the bits as well. In the case having the same timestamp is supposed to be invalid this leaves you a simple deterministic way to tiebreak the situation as part of the creation of the timestamp itself.

*A 10 GHz CPU running for a little over 58 years could risk a 64 bit counter rollover in the middle of a time delta. If that's too close for comfort or you want the code to work on e.g. 32 bit counter systems, you'd need to eat more cycles and registers to set another bit to the outcome of whether current cycle count is < the one at the start of the current timestamp delta.

Yeah, when I was thinking about the hash, I thought of it as stuffing to fill the unused portion of the number that would look better than using zeroes.

TIMESTAMP010111101010101TSC "looks better" than TIMESTAMP000000000000000TSC even though they contain the same information.

I would drop the hash, it's deceptive.

I don't believe it breaks the monotonicity, though? I mean, it would if it weren't already broken. If you're taking the low 16 bits of the TSC, then a rollover in those 16 bits during the same timestamp will already go backwards. TIMESTAMP0...0 follows TIMESTAMPf...f.

I guess it depends which portion you look at. If you solely look at the time based portion you do indeed still have a function which never decreases but that is true even in the case of reading the raw counter whole on its own anyways. If you look at the whole value, including the hashed portion, it's no longer monotonic.

In the cycle based case looking at the whole value is the same thing as looking at a relative time stamp which has more precision that the system clock. In this way, it's "truly" monotonic across the entire value, not just monotonic on a part and unique in another.

Side topic: It also comes with an even stronger guarantee of always increasing instead of just "not changing direction". Is there a name for that?

> it also creates entropy

It's a nitpick, but it concentrates the entropy. It doesn't create any.

I do think it will make the answer more clear, as the hash concentrates the less than 64 bits of entropy on that 128 bits of original data into a usable 64 bits package.

Actually hashes do create entropy (every computation creates entropy in some form or another). What's the entropy of a 4 bit number? What's the entropy of a 4 bit number hashed by a 64 bit hash function? The act of computation does in fact create entropy, as per the 2nd law of thermodynamics, a part of which shows up in the hash.
I don't think you understand what this conversation is about. We are considering information theoretic entropy, not thermodynamic entropy from the mechanism of computation itself.

The result of applying a deterministic function on a random variable cannot have more entropy than the underlying random variable. This is a theorem, one that is trivial enough to not have a name. But you can find solution sets to homework that will prove it for you: https://my.ece.utah.edu/~rchen/courses/homework1_sol_rr.pdf

> every computation creates entropy in some form or another

Ok, what is the entropy created by this function that maps a 4-bit number to a 64 bit number:

    0 -> 0
    1 -> 1
    2 -> 1
    3 -> 1
    4 -> 1
    ...
    15 -> 1
60 bits. Yes, I know, you can compress it down very well. But consider that entropy in computation involves not just the bits you store, but also the bits that the processor touches and eventually dissipates as heat into the universe.
Why?

    low bits + CPU cycle counter
is enough. No need of the hash()
You don't know precisely at which frequency the cycle counter runs. Depending on the system load it might either run faster or slower than the lowest bits the HPTC. For what it's worth this part is more or less nondeterministic, so the sane thing to do, is spread out the information as much as possible (maximize entropy), in order to minimize the probability of collisions.
That's ok, the bits past the low bits are just there to avoid collisions, not an actual measure of high precision time beyond the low bits.

It's not worse than the hash solution, I'm just saying it's not necessary to hash it if the only objective is to reduce collisions.

In fact the hashing solution, if it is replacing the low bits with a hash of low bits plus something else, is actually destroying valuable time information.

That only works, if you know exactly, that the low bits are constant. Otherwise you may run into the issue that due to unsteady rate of RDTSC between two processes/threads that may be preemptively unscheduled between reading the HPTC and the RDTSC you might again end up with colliding time stamps. And if you took the derivative between timestamps taken in succession, you might even find is to be non-monotonic in places.

The combination of multiple counters incremented by individual unsteady clocks used to be a source for pseudo random scrambler sequences; these days we prefer LFSRs, but overall this is something that can be weird.

Hence my recommendation: Just throw xxHash32 on concatenation of the HPTC's low bits and the CPU clock cycle counter, and forgo any pretense of monotony in the low bits (because very likely you don't have it anyway).