entropy of a single signal, say a sequence of letters, "ababababab" is the scaled "average" surprise per letter. So if they are uniformly distributed, each letter is equally likely/unlikely to come next in the sequence, where if instead one letter only 1/1000th of the time (aaa....aaa...aa..a.z.aaaa), then when the rare beast shows up, it is a big surprise, so the total amount of surprise available in the sequence is high.
That's entropy.
The same thing would be true for a sequence of numbers.
But what if there is some relationship? if
aaabaa occurs frequently with 111211, if you line up the sequences by timestamp?
In this simple case, if you know the letters and you can spot the relationship, then there is zero surprise in the number sequence. The cross entropy "letters plus numbers" has the same entropy as "letters" or "numbers" in isolation.
And as you move away from the 1:1 correspondence, you'll see the cross entropy increase until it reaches its max at "entropy(letters) + entropy(numbers)" -- no information shared between the two systems.
To bring it home, I think of cross entropy as the amount of information shared between two signals.
> if instead one letter only 1/1000th of the time (aaa....aaa...aa..a.z.aaaa), then when the rare beast shows up, it is a big surprise, so the total amount of surprise available in the sequence is high
…when a Bernoulli distribution is skewed, the maximum surprise is high, yes, but the average surprise (= entropy) is low. The entropy of a Bernoulli distribution is maximized when p = 0.5 and falls off to either end:
For your examples, if the sequence is uniformly distributed (Bernoulli(1/2)), the entropy is log(2) ≈ 0.693 bits per symbol; if instead one letter occurs 1/1000th of the time, the entropy is about 0.0079 bits per symbol.
entropy of a single signal, say a sequence of letters, "ababababab" is the scaled "average" surprise per letter. So if they are uniformly distributed, each letter is equally likely/unlikely to come next in the sequence, where if instead one letter only 1/1000th of the time (aaa....aaa...aa..a.z.aaaa), then when the rare beast shows up, it is a big surprise, so the total amount of surprise available in the sequence is high.
That's entropy.
The same thing would be true for a sequence of numbers.
But what if there is some relationship? if aaabaa occurs frequently with 111211, if you line up the sequences by timestamp?
In this simple case, if you know the letters and you can spot the relationship, then there is zero surprise in the number sequence. The cross entropy "letters plus numbers" has the same entropy as "letters" or "numbers" in isolation.
And as you move away from the 1:1 correspondence, you'll see the cross entropy increase until it reaches its max at "entropy(letters) + entropy(numbers)" -- no information shared between the two systems.
To bring it home, I think of cross entropy as the amount of information shared between two signals.
Others might think of it slightly differently.