Hacker News new | ask | show | jobs
by kibwen 3366 days ago
A bit isn't a divisible unit, so nobody is going to be confused as to whether "mb" stands for "millibits". As for the capitalization of "b", if we're going to be pedantic then we should say "Mo" for "megaoctets", since "byte" is, as far as IEEE standards are concerned, of ambiguous length. But I think we can trust people enough to not spend too long puzzling over whether "mb" means megabytes or megabits, just as we can trust them to assume that byte implies eight bits in this context.
6 comments

"A bit isn't a divisible unit"

What ? Sure it is. When you measure the information content (or, the entropy) of a message, you very frequently get non-integer numbers of bits per (character/unit/message/whatever).

Written english, for instance, has 1.46 bits of information per character.

Or 146 cb.

That is expressing a ratio. A more concrete example, .46 a person doesn't exist even if that ratio is useful for expressing statistics.
I think "gaining half a bit of information" about something can correspond to stuff that lets you update your probability distribution about it?

I'm not sure.

yes, but your gain is still an integer number. the non-integer ratio is just a statistic (gain per number of units).
Why is it an integer number?

Entropy (in bits) is defined as - \sum_x (p(x) log_2 p(x))

There is no reason this has to be an integer, since probabilities are not restricted to being reciprocals of powers of 2.

Consider also that you can simply use a different logarithm base to get a different unit (e.g. use the natural logarithm to obtain the entropy in nats). It would be bizarre if the arbitrary choice of 2 as the base gave a unit that was indivisible.

I think this whole confusion comes down to the difference between a bit as a "unit of information in the sense of information theory" [divisible] and a bit as a "single physical one or zero" [not divisible]. The relationship between the two is that the entropy of a random variable is a lower bound on the average number of bits required to represent it.

yes, you are quite right -- i was referring to the binary digit rather than the shannon
only when you consider bits to be the final, indivisible, fundamental unit of information.

which they aren't.

if you have a data storage thingy that can store any of three values, a ternary digit, it is exactly equivalent to log2(3) = ln(3) / ln(2) ~= 1.585 bits.

kind of like US pop-science articles like to say stuff like "a volume 1.5 olympic-size swimming pools" (because a megagallon is just weird), even though obviously, can never have half of such a pool or it would empty.

(ok after some consideration, you could have the bottom half)

you are obviously right, but i think that in the specific case described above -- computer code -- we have binary digits as final and indivisible units.
Filling the top half would be more fun

https://what-if.xkcd.com/6/

Or to express it another way, a variable with 3 possible states has 0.5 bits more capacity than one with 2.
0.585... bits more capacity, but yeah (bits = log2(states)).
> you very frequently get non-integer numbers of bits per (character/unit/message/whatever)

That'd be an average, and it's like when you have 2.58 people per household. Presumably most people do not keep about 6/10 of a person around.

The point of a bit (in information theory) is that it is the smallest possible (read: not divisible) unit of information.

> But I think we can trust people enough to not spend too long puzzling over whether "mb" means megabytes or megabits

Please don't assume this. I have the great pleasure working with network Engineers, who have apparently globally decided that bits are a perfectly reasonable measurement of throughput and react very differently to speed in Mb/s and MB/s. I'm not trying to be pedantic or say that this is how it should be, I'm just saying that people really do use both units and it is horribly confusing and anything you can do to not be ambiguous is appreciated.

Well, no. If we're going to be pedantic, we should say MiB (mebibytes), because file sizes on disk are expressed in multiples of powers of 2, but the SI prefixes are multiples of 10.

So a 1 megabyte file (as reported by the file system) is actually 1048576 bytes, which technically - sorry, I mean pedantically - speaking, is 1 mebibyte.

To make matters worse, disk manufacturers use the decimal prefixes, so our nice 1 terabyte drive is 931 mebibytes, but is reported by the file system as 931 megabytes (not MiB).

Finally, memory manufacturers use the binary prefix, so 1 megabyte of RAM is actually 1 mebibyte (1048576 bytes).

A bit of a mess, no?

All the above is, IMHO, a consequence of imprecision. If we get used to being loose with our terminology, we risk carrying that attitude over into our work product, with sometimes regrettable results.

So I'll continue to strive to be pedantic (translation: precise).

> Well, no. If we're going to be pedantic, we should say MiB (mebibytes), because file sizes on disk are expressed in multiples of powers of 2,

Not true anymore. OS X (and I assume iOS) reports sizes in power-of-10 units.

If you think about it, it is really user-hostile to express file sizes as powers-of-two. Who can remember that a "GiB" is 1073741824 bytes?

I didn't know OS/X used the decimal prefixes, but that just means it's less true, not untrue. There are still many more systems out there that use the binary prefix. I imagine most *nix, and not sure about Windows. And RAM is still power of 2.

I don't think it's terribly user hostile to express sizes as powers of two when you work with these kinds of numbers for a living, especially when it's near the bare metal (Erlang binary data type FTW!)

But I do think it's user hostile to have two different units depending on what you're looking at. If it were all decimal or all binary, it would be much easier.

You probably meant 1TB to be 931 Gibibytes, didn't you?
Ah crap, yes, I did. Got to stop this middle of the night posting...
While we're at it, one should use “Mi”, not “M”. M is still 1000 * 1000, while Mi is 1024 * 1024.

That amount of sloppiness in any other engineering discipline would just finish you off immediately.

Agreed. Fun fact: Wolfram alpha understands Mebibytes, which is useful if you want to quickly convert between networking specs (say megabits) and "real" computer units.

Then again, maybe doing more simple math by actually using one's brain wouldn't hurt either. :)

Edit: And yes, I'm aware that you'll never get the converted speed of what is written on the network device's box. But sometimes it's nice to have an upper limit you can compare to at least.

Many, many wire protocols use 5 bits of bandwidth to send 4 bits of information, for various reasons. So dividing by 10 gives you a better estimate.

Of course when gigabit became a thing, your practical throughput was more like 75 MBps for a very long time, and being off by 25% in capacity planning is a pretty big error (one I've seen numerous engineers make, and a few make both, which means you're off by 40%)

After using many Unix tools that have this convention, I'm ok with 10 M referring to 1010241024 bytes (10 MiB), contrasted with 10 MB meaning 10,000,000 bytes.
MacOS (we are talking about Xcode) is using the SI definition though. 10 Mega should be 10 Million.
Same with HD capacity info.
Ah, I didn't see your comment before I wrote mine. And I agree violently with your point of view.
This is gatekeeping, since the message is coded to be obvious to those "in the know" (of course mb means mebibyte!) but is a barrier to those who are trying to learn more (mb probably means megabits per second? it's a unit for measuring download speed? why is the 's' left off?)

This is putting the burden of collaboration in the wrong place: it shouldn't be a question of, can we expect a reasonable engineer in the industry to understand this unambiguously (with some deductions); but rather, can I hold down the shift key when typing the abbreviation for megabytes.

Obviously this depend on the actual audience, don't bother following this in team chat where speed is more important than clarity.

It is divisible in information and coding theory, where it makes perfect sense.