Hacker News new | ask | show | jobs
by cosmos0072 253 days ago
The math looks suspicious to me, or at least how it is presented.

If, as stated, accessing one register requires ~0.3 ns and available registers sum up to ~2560 B, while accessing RAM requires ~80 ns and available RAM is ~32 GiB, then it means that memory access time is O(N^1/3) where N is the memory size.

Thus accessing the whole N bytes of memory of a certain kind (registers, or L1/L2/L3 cache, or RAM) takes N * O(N^1/3) = O(N^4/3).

One could argue that the title "Memory access is O(N^1/3)" refers to memory access time, but that contradicts the very article's body, which explains in detail "in 2x time you can access 8x as much memory" both in text and with a diagram.

Such statement would require that accessing the whole N bytes of memory of a certain kind requires O(N^1/3) time, while the measurements themselves produce a very different estimate: accessing the whole N bytes of memory of a certain kind requires O(N^4/3) time, not O(N^1/3)

5 comments

I did not interpret the article as you did, and thought it was clear throughout that the author was talking about an individual read from memory, not reading all of a given amount of memory. "Memory access, both in theory and in practice, takes O(N^⅓) time: if your memory is 8x bigger, it will take 2x longer to do a read or write to it." Emphasis on "a read or write".

I read "in 2x time you can access 8x as much memory" as "in 2x time you can access any byte in 8x as much memory", not "in 2x time you can access the entirety of 8x as much memory". Though I agree that the wording of that line is bad.

In normal big-O notation, accessing N bytes of memory is already O(N), and I think it's clear from context that the author is not claiming that you can access N bytes of memory in less time than O(N).

Nobody has ever had this confusion about the access time of hash tables except maybe in the introductory class. What you’re describing is the same reasoning as any data structure. Which is correct. Physical memory hierarchies are a data structure. Literally.

I’m confused by GP’s confusion.

This is completely false. All regularly cited algorithm complexity classes are based on estimating a memory access as an O(1) operation. For example, if you model memory access as O(N^1/3), linear search worse case is not O(N), it is O(N^4/3): in the worse case you have to make N memory accesses and N comparisons, and if each memory access in N^1/3 time, this requires N^4/3 + N time, which is O(N^4/3).
> linear search worse case is not O(N), it is O(N^4/3)

No, these are different 'N's. The N in the article is the size of the memory pool over which your data is (presumably randomly) distributed. Many factors can influence this. Let's call this size M. Linear search is O(N) where N is the number of elements. It is not O(N^4/3), it is O(N * M^1/3).

There's a good argument to be made that M^(1/3) should be considered a constant, so the algorithm is indeed simply O(N). If you include M^(1/3), why are you not also including your CPU speed? The speed of light? The number of times the OS switches threads during your algorithm? Everyone knows that an O(N) algorithm run on the same data will take different speeds on different hardware. The point of Big-O is to have some reasonable understanding of how much worse it will get if you need to run this algorithm on 10x or 100x as much data, compared to some baseline that you simply have to benchmark because it relies on too many external factors (memory size being one).

> All regularly cited algorithm complexity classes are based on estimating a memory access as an O(1) operation

That's not even true: there are plenty of "memory-aware" algorithms that are designed to maximize the usage of caching. There are abstract memory models that are explicitly considered in modern algorithm design.

You can model things as having M be a constant - and that's what people typically do. The point is that this is a bad model, that breaks down when your data becomes huge. If you're tying to see how an algorithm will scale from a thousand items to a billion items, then sure - you don't really need to model memory access speeds (though even this is very debatable, as it leads to very wrong conclusions, such as thinking that adding items to the middle of a linked list is faster than adding them to the middle of an array, for large enough arrays - which is simply wrong on modern hardware).

However, if you want to model how your algorithm scales to petabytes of data, then the model you were using breaks down, as the cost of memory access for an array that fits in RAM is much smaller than the cost of memory access for the kind of network storage that you'll need for this level of data. So, for this problem, modeling memory access as a function of N may give you a better fit for all three cases (1K items, 1G items, and 1P items).

> That's not even true: there are plenty of "memory-aware" algorithms that are designed to maximize the usage of caching.

I know they exist, but I have yet to see any kind of popular resource use them. What are the complexities of Quicksort and Mergesort in a memory aware model? How often are they mentioned compared to how often you see O(N log N) / O(N²)?

> It is not O(N^4/3), it is O(N * M^1/3). There's a good argument to be made that M^(1/3) should be considered a constant

Math isn't mathing here. If M^1/3 >> N, like in memory-bound algorithms, then why should we consider it a constant?

> The point of Big-O is to have some reasonable understanding of how much worse it will get if you need to run this algorithm on 10x or 100x as much data

And this also isn't true and can be easily proved by contradiction. O(N) linear search over array is sometimes faster than O(1) search in a hash-map.

This is untrue, algorithms measure time complexity classes based on specific operations, for example comparison algorithms are cited as typically O(n * log(n)) but this refers to the number of comparisons irrespective of what the complexity of memory accesses is. For example it's possible that comparing two values to each other has time complexity of O(2^N) in which case sorting such a data structure would be impractical, and yet it would still be the case that the sorting algorithm itself has a time complexity of O(N log(N)) because time complexity is with respect to some given set of operations.

Another common scenario where this comes up and actually results in a great deal of confusion and misconceptions are hash maps, which are said to have a time complexity of O(1), but that does not mean that if you actually benchmark the performance of a hash map with respect to its size, that the graph will be flat or asymptotically approaches a constant value. Larger hash maps are slower to access than smaller hash maps because the O(1) isn't intended to be a claim about the overall performance of the hash map as a whole, but rather a claim about the average number of probe operations needed to lookup a value.

In fact, in the absolute purest form of time complexity analysis, where the operations involved are literally the transitions of a Turing machine, memory access is not assumed to be O(1) but rather O(n).

Time complexity of an algorithm specifically refers to the time it takes an algorithm to finish. So if you're sorting values where a single comparison of two values takes O(2^n) time, the time complexity of the sort can't be O(n log n).

Now, you very well can measure the "operation complexity" of an algorithm, where you specify how many operations of a certain kind it will do. And you're right that typically comparison sorting algorithms complexities are often not time complexities, they are the number of comparisons you'll have to make.

> hash maps, which are said to have a time complexity of O(1), but that does not mean that if you actually benchmark the performance of a hash map with respect to its size, that the graph will be flat or asymptotically approaches a constant value.

This is confused. Hash maps have an idealized O(1) average case complexity, and O(n) worse case complexity. The difficulty with pinning down the actual average case complexity is that you need to trade off memory usage vs chance of collisions, and people are usually quite sensitive to memory usage, so that they will end up having more and more collisions as n gets larger, while the idealized average case complexity analysis assumes that the hash function has the same chance of collisions regardless of n. Basically, the claim "the average case time complexity of hashtables is O(1)" is only true if you maintain a very sparse hashtable, which means its memory usage will grow steeply with size. For example, if you want to store thousands of arbitrary strings with a low chance of collisions, you'll probably need a bucket array that's a size like 2^32. Still, if you benchmark the performance of hashtable lookup with respect to its size, while using a very good hash function, and maintaining a very low load ratio (so, using a large amount of memory), the graph will indeed be flat.

> if you model memory access as O(N^1/3), linear search worse case is not O(N), it is O(N^4/3)

This would be true if we modeled memory access as Theta(N^{1/3}) but that's not the claim. One can imagine the data organized/prefetched in such a way that a linear access scan is O(1) per element but a random access is expected Theta(N^{1/3}). You see this same sort of pattern with well-known data structures like a (balanced) binary tree; random access is O(log(n)), but a linear scan is O(n).

> "in 2x time you can access 8x as much memory"

is NOT what the article says.

The article says (in three ways!):

> if your memory is 8x bigger, it will take 2x longer to do a read or write to it.

> In a three-dimensional world, you can fit 8x as much memory within 2x the distance from you.

> Double the distance, eight times the memory.

the key worda there are a, which is a single access, and distance, which is a measure of time.

N is the amount of memory, and O() is the time to access an element of memory.

The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower bound time. Though if done right all of that latency will be occupied with computation and allow people to delude themselves into thinking it doesn’t matter.

But when something is constrained two or three ways, it drastically reduces the incentive to prioritize tackling any one of the problems with anything but small incremental improvements.

> The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower bound time.

It doesn't. Full scans are faster than accessing each memory address in an unordered way.

Let's look at a Ryzen 2600X. You can sustain 32 bytes per second from L1, 32 bytes per second from L2, and 20 bytes per cycle from L3. That's 64KB, 512KB, and 16MB caches all having almost the same bandwidth despite very different latencies.

You can also imagine an infiniband network that fills 2 racks, and another one that fills 50000 racks. The bandwidth of a single node is the same in both situations, so even though latency gets worse as you add more nodes and hops, it's going to take exactly O(n) time for a single thread to scan the entire memory.

You can find correlations between memory size and bandwidth, but they're significantly weaker and less consistent than the correlations between memory size and latency.

I don’t think I can agree to ignore the latency problem. Intermachine Latency is the source of a cube root of N term.
Once you start receiving data in bulk, the time it takes is quantity of data divided by your connection speed. Latency doesn't factor in.

Technically you need to consider the time it takes to start receiving data. Which would mean your total time is O(n + ∛n). Not O(n * ∛n). But not all nodes are ∛n away. The closest nodes are O(1) latency. So if you start your scan on the close nodes, you will keep your data link saturated from the start, and your total time will be O(n). (And O(n + ∛n) simplifies to O(n) anyway.)

I don't think we have such a lower bound: from a “theoretical" point of view (in the sense of the post), your processor could walk on the cube of memory and collect each bit one by one. Each move+read costs O(1) (if you move correctly), so you get O(n) to read the whole cube of n bits.

If you require the full scan to be done in a specific order however, indeed, in the worse case, you have to go from one end of the cube to the other between each reads, which incurs a O(n^{1/3}) multiplicative cost. Note that this does not constitute a theoretical lower-bound: it might be possible to detect those jumps and use the time spent in a corner to store a few values that will become useful later. This does look like a fun computational problem, I don't know it's exact worst-case complexity.

I think the article glossed over a bit about how to interpret the table and the formula. The formula is only correct if you take into account the memory hierarchy, and think of N as the working set size of an algorithm. So if your working set fits into L1 cache, then you get L1 cache latency, if your working set is very large and spills into RAM, then you get RAM latency, etc.
I hate big O notation. It should be O(N) = N^(1/3)

That way O is the function and it's a function of N.

The current way it's notated, O is the effectively the inverse function.

Yes, the notation seems wrong but it is because O(...) is a set, not a function. The functions is what goes inside.

So it should be f€O(...) instead of f=...

(don't know how to write the "belongs" symbol on iOS).

Here you go: ∈ (so f ∈ (...))
That helps but how do I learn it?
O(N) is not a function, though, so the notation is doing a good job of saying this. When we say "the complexity class of an algorithm is O(log N)", we mean "the function WorseCaseComplexity(N) for that algorithm is in the class O(log N)", The function "WorseCaseComplexity(N)" measures how much time the algorithm will take for any input of size N in the worse case scenario. We can also say "the average case complexity of quicksort is O(N log N)", which means "the function AverageCaseComplexity(N) for quicksort in the class O(N log N)", where AverageCaseComplexity(N) is a function that measure how much time quicksort will need to finish for an "average" input of size N.

Saying that a function f(n) is in the class O(g(n)) means that there exists an M and a C such that f(n) < C * g(n) for any n < M. That is, it means that, past some point, f(n) is always lower than C * g(n), for some constant C. For example, we can say the function f(n) = 2n+ 1 is in the class O(n), because 2n + 1 < 7n for any n > 1. Technically, we can also say that our f(n) is in the complexity class O(2^n), because 2n + 1 < 2^n for any n >= 3, but people don't normally do this. Technically, what we typically care about is not O(f(n)), it's more of a "least upper bound", which is closer to big_theta(n).

There are a lot of misconceptions and things being confused together in your post. For one complexity classes are related to decision problems, not algorithms. An algorithm does not have a complexity class, rather it's decision problems that belong to a complexity class. 3SAT is a decision problem that belongs to the NP-complete complexity class regardless of any particular algorithm that implements a solution to it.

Secondly, it's a common misconception that O(f(n)) refers to worst case complexity. O(f(n)) is an upper bound on the asymptotic growth of f(n). This upper bound can be used to measure best case complexity, worst case complexity, average case complexity, or many other circumstances. An upper bound does not mean worst case.

For example I often perform code reviews where I will point out that a certain operation is implemented with the best case scenario having O(log(N)), but that an alternative implementation has a best case scenario of O(1) and consequently I will request an update.

I expect my engineers to know what this means without mixing up O(N) with worst-case scenarios, in particular I do not expect an improvement to the algorithm in the worst case scenario but that there exists a fast/happy path scenario that can avoid doing a lot of work.

Consequently the opposite can also happen, an operation might be implemented where the worst case scenario is o(N), note the use of little-o rather than big-o, and I will request a lower bound on the worst case scenario.

> An algorithm does not have a complexity class, rather it's decision problems that belong to a complexity class. 3SAT is a decision problem that belongs to the NP-complete complexity class regardless of any particular algorithm that implements a solution to it.

Both algorithms and problems have their own complexity. The complexity of an algorithm is the time it takes that algorithm to finish. For example, quicksort is an algorithm, and its average case complexity is in O(n log n). I called O(n log n) a complexity class, this was indeed sloppy on my part - I was just trying to find a name for this set.

> Secondly, it's a common misconception that O(f(n)) refers to worst case complexity.

I know, and my post was very explicitly avoiding this misconception. I explicitly gave the example that the average case complexity of quicksort is in O(n log n).

> Consequently the opposite can also happen, an operation might be implemented where the worst case scenario is o(N), note the use of little-o rather than big-o, and I will request a lower bound on the worst case scenario.

I'm not sure what you're trying to say in this part. o(n) is basically the set of all sublinear functions, e.g. log(n), but also 1. Any function in o(n) is also in O(n), though the converse doesn't hold. So, if you know an algorithm has worse complexity in o(n), you know more information about the algorithm than if I tell you it has worse case complexity in O(n). Knowing a lower bound is still useful in either case, of course, since a constant time algorithm and a logarithmic time algorithm are still both in both sets.

No it shouldn't. The function you're talking about is typically called T(N), for "time". The problem is that you can't write T(N) = N^(1/3) because it's not exactly N^(1/3) -- for one thing, it's approximate up to a constant factor, and for another thing, it's only an upper bound. Big-O solves both of these issues: T(N) = O(N^(1/3)) means that the function T(N) grows at most as fast as N^(1/3) (i.e.: forms a relationship between the two functions T(N) and N^(1/3)). The "T(N) =" is often silent, since it's clear when we're talking about time, so at the end you just get O(N^(1/3)).
All their numbers are immediatley suspect since they admit to using ChatGPT to get their numbers. Oh, and their wonderful conclusion that something that fits fully in the CPU's caches is faster than something sitting a few hops away in main memory.