Hacker News new | ask | show | jobs
by Lazare 2088 days ago
The concept of ULID is interesting, but the spec is a bit weird[1]. If you want the benefits of ULID, I'd highly suggest checking out KSUIDs:

https://github.com/segmentio/ksuid

https://segment.com/blog/a-brief-history-of-the-uuid/

Same advantages of ULIDs, but I prefer the base62 to the base32 encoding (more compact; no need to bikeshed about upper versus lower case), it's been tested at scale, and the decisions made are sensible.

[1]: Specifically, they try and guarantee absolute monotonicity. The way they do this is that if you ever try and generate more than on ULID per millisecond, you increment the least significant bit of the random component. In other words, we have a key that's basically <timestamp>-<random-int>, and if you generate more than one key per timestamp, you just increment the random number. If the random number would overflow, by the spec, you have to just throw an exception; no wraparound. There's a lot of issues here. For one thing, none of this can possibly work if you're generating your IDs in a distributed fashion; it assumes a single, central, consistent key generator. For another, our key generator now has state, since it needs to know if any keys have been generated earlier, and if so, what they were. Doable, but...potentially a lot of work depending on your environment. Also, why are we even trying to force strict monotonicity? What does that possibly gain you? Why would we want a spec that, by design, has a chance of sometimes not letting you generate a key? The whole thing feels like the result of someone really wanting an auto-incrementing primary key, hearing that UUIDs were cool, and trying to make a auto-incrementing primary key that looks like a UUID, ending up without the advantages of either. Of course, you could ignore the spec (and several implementations do), but at this point it's worth asking what you're gaining from ULID. It's a weird feature that basically only works if you don't need it (since realistically, anyone generating many keys per millisecond would of course need to generate the keys in a distributed fashion).

3 comments

Thanks for the alternative (KSUIDs) I took a quick look at it, and from what I can tell, if you aren't generating thousands of IDs 1 per second (or more) they achieve the exact same result. [0]

I find the argument that ULID won't work under extreme and harsh conditions proof that it's just fine for many of us that simply do not work on systems with that kind of load/requirements.

I appreciate seeing the weaknesses of ULID, as this helps me choose whether or not I can live with them. (which I can)

Again, thanks for the detailed reply, it was very helpful.

[0] https://github.com/segmentio/ksuid/issues/8

Yeah. ULID will work. It's just that there's a lot of small annoyances and quirks, and no real advantages. If you've already adopted it, it's probably not worth changing, but I can't think why you'd select it given a choice for greenfield development.
What are the "annoyances and quirks" you are referring to?

I hate programming myself into a corner only to figure this out much later. So I appreciate any serious feedback on this.

I described the advantages for me in my first post. Maybe they are situational?

I have no issues adopting a new UID system, but so far what everyone has described as problems don't apply to the work I am doing.

Right, your initial post said:

> ULIDs are sortable (time component), short (26 chars) and nearly human readable, and good enough entropy/randomness for everything I'd ever be working on.

1) Being roughly sortable is a valuable characteristic if used as a DB index in most DB systems, and ULIDs meet this requirement (as do others). However, they are NOT strictly time sortable except under very specific conditions. If you need that, you should avoid ULIDs (because the design is broken). But if you don't, then...well, ULIDs will work fine for you! Of course, so would other options. :)

2) 26 characters is fine. However, the use of base32 is a bit unfortunate. Some implementations generate lowercase IDs; others generate uppercase. The spec officially recommends generating uppercase IDs, but if you're not careful (and are generating them on different platforms), you may end up with a mix of both, and naive lexical sorting will yield incorrect results. You need to normalise all your IDs; doable but overhead. Conversely, if a denser encoding was chosen (like base62 or base58), you could either get the same entropy in a shorter string, or more entropy in the same string, but ALSO not need to worry about case. Win/win, from my point of view.

3) Human readable...eh, not sure "01ARZ3NDEKTSV4RRFFQ69G5FAV" is very human readable, but okay. I'd score it no better or worse than a normal UUID (like "fe9c8a21-61c1-44dd-b552-95616d62404d") or a KSUID (like "0ujssxh0cECutqzMgbtXSGnjorm").

4) ULIDs are fine in terms of entropy. Again, this doesn't set them apart from others.

Apart from the question of being strictly sortable (if you need that, you need a different tech), ULIDs are fine. But they're also not unique. Some people re-order a v1 UUID so it starts with a timestamp, and obtain the same benefits. Others use KSUIDs and obtain the same benefits. Others roll their own "slam a timestamp and some random data together and base whatever encode it" and...obtain the same benefits.

It's cool to have an ID which is roughly sortable by time, is safe to generate on multiple systems, and can be encoded as a relatively short alpha-numeric string, and I think a lot of people can benefit from this. Certainly I've found uses for this, which is why I've opted to use KSUID.

...that being said, ULID also works for this. It's just...odd. Like, 80 bits of randomness per millisecond is fine! But actually, if you follow the spec and generate them on a single system, only the first ULID is random; the rest just increment until you run out of room, which means you can generate a random number of ULIDs per millisecond. Anywhere between a bit over one septillion and one. Of course, the odds of getting a small amount of room is low, so as a practical matter it's fine. But it's odd that, if you ever did try and generate several ULIDs at once, there's just a random chance it might not work! Of course, you can just wait a millisecond and try again, so no big deal.

Of course, that opens another issue. If you tell me a ULID, then if you're following the spec, and also generating more than one ULID per millisecond, then if I have one ULID I now know the other ULIDs! For many people a major attraction of IDs like this is that they're unguessable even if you have seen a bunch, but a ULID is actually, in theory, open to enumeration attacks. That's abasolutely crazy! If you managed to make a system you don't control generate a ULID for you at the same millisecond it generates one for your target, you can now trivially guess the target's ULID. If you were relying on that ULID being unguessable for security reasons (which you would be if you were using it as an API key, session identifier, or unguessable URL), then it's completely compromised. I admit, hard attack to pull off, but wow, imagine it even being possible to do that!

Of course, in the real world, it's fine. I don't like base32, but it's good enough. The crazy increment logic is a wtf, but whatever. ULIDs are most broken when you use them for what they're designed for (guaranteeing strict monotonicity at scale), but the design is so impractical you'll give up before you hit the landmines. And if used more "normally", then they'll be good enough. I still wouldn't choose ULIDs for a project I control, just because it's so odd. But I doubt you'd gain any benefit from switching to, eg, KSUIDs now.

Wow, thanks this was more than I expected. Here's my considerations on your reply, though it's likely not useful for you, but it is good for me to write this out.

1) DB performance has as much (or more) todo with time sorting as anything. But I am not an expert, I just rely on what I read from others. (hence, my questions to you, and I appreciate your time/effort on your responses here)

But, I do appreciate the inherit use of a time part, just makes me feel better about having a "randomish" id, but not loosing what I felt I had previously with incremented integers.

2) This is one reason I _prefer_ ULID to other options. I am using ULIDs in low count generated IDs per second, and I want to use in them in URLs. All lowercase is a trivial thing to implement. And I remove ambiguity about content (ie, 1,i,I,L, etc...)

So for human readable ULID is almost unparalleled from what I have found.

Is there another UID system that is just as human readable? I'd love to find something else to compare it to. (that I haven't already)

3) #2 dove-tails into #3 I guess, but yes, I look at data all day long, and the shorter the better. The fewer conflicts, misunderstandings, etc... the more human readable it is.

Is there another UID that is _more_ human readable?

4) I don't need my UIDs to be special, I just need them to be useful.

Why is something "odd" to you because of an arbitrary number like "80 bits"? Am I missing something about unique ids that require a specific hex/octet/base-10 value or they don't work properly?

As far as I know _all_ unique ids have the potential for conflicts, there is no getting around this.

A) I don't have any issues with making more than one ULID per millisecond, I spend way more CPU time on other things, I have no issue holding back ULID generation to avoid the "countable" error of multiple ULIDs that can be guessed.

Thank you for pointing out this flaw, it would be nice if it didn't exist, and maybe the libraries should have some kind of flag/option to enable guaranteed random values at the cost of speed. (which considering how long it takes to download very large background video on a splash/hero banner on a site, I think solves what seems to be 100% of the criticisms you have against ULIDs.

B) Here's a discussion about ULID vs KSUID, which includes replies from the author you linked to from Segment, Rick Branson.

https://github.com/segmentio/ksuid/issues/8

Rick says this about ULID vs KSUID:

"If one is concerned about the extra 4 bytes of data or if millisecond precision is needed, then ULID is probably a better choice."

It seems that ULID has some advantages over KSUID, and it's not so cut and dry that KSUID is superior in all ways.

I don't see how a distributed system is an "extreme condition". If you don't have distributed ID generation, why not use an auto incrementing u64 and call it a day?
I will have at most 100 users generating an ID in a day (on the project I am using ULIDs with). The possibility of them having a conflict with generating the ULID is so low that I don't even consider it a concern.

So, "extreme condition" is one where someone could actually generate a conflict. A realm of production I don't work in. (maybe Twitter/Facebook would have this problem)

Therefore, I feel an easier to use/work with UID like ULID is both reasonable and preferable over other options.

Sadly they wrap around in 133 years. Maybe Segment don't plan on still existing in 2153? Also, base62 really should give way to base58 for any identifier that an ops person might have to type in a hurry.

Thus the quest for the perfect identifier continues.

NB: generating many unique IDs per millisecond for a long period may be a hallmark of a large distributed system, but even a small application may want this for a brief time e.g. importing bulk customer data.

I do not understand your criticism. ULID tries to guarantee monotonicity with any single generator, which is nice if you need it and irrelevant if you don't. The state that needs to be saved for that feature is exactly the last generated ULID (if it's the same microsecond still, increase, else regenerate). And if you are generating on the order of 2^40 ULIDs per microsecond, you'll have to have a larger id space anyway.

So,for me, your criticism boils down to "this is weird because it has a feature I don't need". Why would you care?

> ULID tries to guarantee monotonicity with any single generator, which is nice if you need it and irrelevant if you don't

The thing is, you don't need it. Nobody needs it. And I know this is mean, but...if you think people need it, you probably shouldn't be writing specs for things like this, because it suggests you haven't really thought about the problem.

Keep in mind:

1. If you want guaranteed monotinicity within a single generator, just increment an integer in a DB column. This is a solved problem!

2. ULID cannot guarantee monotonicity. Instead, the spec does something that will guarantee it if you use the library in an extremely specific way. The second you generate two ULIDs on different systems (or even in different processes on the same system), all bets are off. Which means you really shouldn't rely on that strict monotonicity!

3. But if you can't rely on it, then it serves no purpose. And if it serves no purpose, then you could remove it, and at a stroke it becomes much simpler to implement tje spec. As you note, you don't need much state, but any state greatly complicates something like this.

> your criticism boils down to "this is weird because it has a feature I don't need"

It's a feature that literally nobody needs, implemented in a way that doesn't work. It's presence is so bizarre, it raises questions about the entire spec.

I disagree with the strong sentiment that nobody needs it. In fact, I can think of several situations where this would have been helpful. For example, "ids generated by each import/client/process are guaranteed to be sortable."

But in any case, this is such a simple spec and such an irrelevant feature thet this sounds very much like bikeshedding to me now. As such, please accept my apologies and do choose what is best for you.