Hacker News new | ask | show | jobs
by twawaaay 1453 days ago
> Are you claiming that it is faster in practice to insert an element at the beginning of an array with 1G items, than it is to insert it at the beginning of a linked list with 1G items?

You missed that insertion point is selected at random and must be found as part of the operation. But in practice if you know data will be inserted preferentially at the wrong end -- you can just easily reverse the data structure to make insertions prefer the better end.

> I am certain the first item will not be true

I am used to it. Everybody gets it wrong until they spend some time thinking about it, get it explained and/or run a real life test.

Hint: think about the cost of iterating over half of the linked list versus performing binary search over the array AND moving half of it.

The cost of binary search goes to negligible pretty fast and cost of moving 8 bytes of memory is always going to be lower than the cost of iterating over one entry (16 bytes) of linked list. And you have statistically equal number of both assuming you are selecting values at random with equal chance for each 64 integer to be next choice.

CPU can be moving a lot of bytes at the same time but it has to dereference the pointer before it can get to perform comparison before it can dereference the next pointer...

Actual algorithmic complexity of both algorithms is exactly the same. But an array is much more efficient (instructions per item).

3 comments

You are right and this is will know, but not only for the reason you state. Cache locality is so much better on an array/vector compared to linked list and that is really important.
Good point. Cache locality is another important reason that I should have mentioned. I have not run tests, but it may possibly be the most important reason of them all.

In many linked data structures you can mitigate a lot of this problem. For example, in the past I have used contiguous arrays where I would allocate linked elements, then I would try to exploit how the elements are produced to get them placed one after another, if possible.

When data is produced randomly you can try to create "buckets", where each bucket is backed by an array and attempts to keep consecutive elements relatively close to each other. You could then try to rewrite those buckets in a way that gets amortised. Or you can just straight write it as if it was an array of sorted linked list elements where on each delete or insert you have to move some of the elements, but you only need to reorganise one bucket and not others.

All a bit cumbersome and at the end of it you are still wasting at least half of the cache on pointers.

Here is a video by Bjarne himself explaining this: https://youtu.be/YQs6IC-vgmo

Note there is a use case for linked list, inserting/deleting one or few items once in a while. But in general you can just use vector for anything that is not a leetcode problem.

A fun related thing I first learned about years ago due to a Dr Dobbs article: some strided access patterns are FAR worse than random access patterns. This is because certain strides hit the worst case behavior of limited associativity caches, dramatically reducing the effective cache size.
> The cost of binary search goes to negligible pretty fast and cost of moving 8 bytes of memory is always going to be lower than the cost of iterating over one entry (16 bytes) of linked list. And you have statistically equal number of both assuming you are selecting values at random with equal chance for each 64 integer to be next choice.

That makes a lot of sense, thank you for the more detailed explanation.

I think you misunderstood the question.

If you insert an element at the beginning, there's no list iteration cost.

Please, read the question.

It stipulates that elements are selected at random.

You select k integers at random.

Some of them might be inserted faster into linked list, but when you average it over k integers you will still be slower because inserting into array will be faster, on average.

Just think what happens if the integer is inserted at the END of the list (same probability...) You need to slowly iterate over entire linked list. While you quickly land at the result with a binary search on an array and then pay almost no cost inserting it there.

If you think about it, ON AVERAGE, you have to search through half of the list (if you are linked list) or move half of the array (if you are an array).

The parts of simiones's message you quoted are not about the average case. They are about the corner case when an item would be added the beginning, meaning the linked list has the biggest advantage.
You don't get to change the rules of the game while the game is being played.

The original problem, as stated by me, was:

"You are generating k random 64 bit integers. Each time you generate the integer, you have to insert it in a sorted collection.

(...)

The question: in your estimation, both algorithms being implemented optimally, what magnitude k needs to be for the linked list to start being more efficient than array list."

The cost is not evaluated for any individual insertion, only for the total cost of inserting k integers.

For some of the integers the cost of inserting to linked list will be higher, and for some the cost of inserting to array will be higher. It does not matter. We do not care. We just care about what is the cost of inserting k integers thus randomly generated a) into sorted linked list and b) into sorted array list.