Hacker News new | ask | show | jobs
by usefulcat 2644 days ago
The article mentions a couple of options, one of which is tombstones. In a chaining implementation, the load factor is straightforward: num_items / table_size. Adding items increases the load factor and deleting items reduces it.

With open addressing, deletions do not decrease the load factor (at least not immediately, in general), because a deleted item becomes a tombstone. So for open addressing, load factor is (num_items + num_tombstones) / table_size.

The upshot is that in a scenario where items are being repeatedly added and removed, over time the load factor will gradually increase until eventually the table has be recreated. This is true even if the average number of items remains relatively constant over time, and is unlike a chaining implementation.

In this scenario, the average insertion time might be lower with open addressing, but the worst case will be much worse than for chaining.

1 comments

You can eliminate tombstones over time by:

  # any time a tombstone immediately preceeds a empty, it can be marked empty
  [ $ _ ] -> [ _ _ ]
  # any time you lookup a key, it can be swapped with a tombstone immediately preceeding it
  [ $ A ] -> [ A $ ]
  # (moving the tombstone closer to a empty that will destroy it)
  # if you don't have iterators, you can also jump over other keys
  [ $ B A ] -> [ A B $ ]
  [ $ B C A ] -> [ A B C $ ]  # etc
  # (this will cause a iterator at A to yield B again, or at B to skip A)
How well this keeps the load factor down depends on how aggressively you look for tombstone disposal opportunities, but it does keep it down.
Mutating the table on lookup seems pretty gross, though.
I would argue that "the table" is not mutated, only the internal state of its implementation. Every time you access any information, a cache at some layer below you is updated. Is that also gross?
Yes. Normally you can have one thread writing to a data structure OR many threads reading the data structure at any given time and not need to worry about them causing problems. (This situation is common enough that we have things called "reader-writer mutexes" or "shared-exclusive" mutexes.)

As soon as your reads can modify the internal state of the data structure, it might modify the state in a way which trips up another read; so you can no longer have many threads reading the data structure at once.

But you don't need to write every time, only on occasion, so you can actually use a read write lock and in the nominal case many threads can read just fine.

That said, it's probably still better to avoid this unless it's absolutely necessary to modify the underlying structure sometimes, I recently had to do this for an LRU cache.

Right. And in Rust implementing the hash table that way will suddenly make the table no longer flagged as "Sync" by the compiler, so you will be unable to share it between threads.
That's a great point. It's still a possible optimization with a compare-and-swap or if you can determine that you're in a single threaded context.
Eh, it’s the classic amortized approach. Whoever you ca “touch” the data and you’re right there already due to a lookup, it makes sense to amortize your data structure housekeeping IMO.

TBH, the right answer is always due to the users use case (Amortization and housekeeping really help with purely functional data structures), and benchmark data.

Well, it still works (just slower) if you only do fixups during {table[key] = val} operations. But honestly, if you're using a probabilistic data structure like a hash table, the ship has already sailed on gross implementation details.
No, that's the common approach with chained lists: Move found list item to first.