Hacker News new | ask | show | jobs
by leotaku 1406 days ago
Some CDC implementations I have seen use a desired "average" chunk size value in addition to a minimum and maximum value. If the chunk exceeds the desired average size, the test for recognizing a byte sequence as a stop becomes more forgiving. Other solutions also retry previously processed sequences using the simpler threshold.

However, from what I've seen, these methods generally come at the cost of deduplication and/or speed. The most reliable method to avoid pathological cases seems to just be setting the min/max chunk size to a low/high enough value respectively.

If you're talking in a purely theoretical sense, I would assume that the possibility of changes affecting non-local chunks is inherent to CDC. With well-chosen parameters the likelihood of any but the closest chunks being affected just becomes low enough to be negligible.

1 comments

The problem is that pathological cases are things like a repeating pattern (or repeating byte). Another issue is deliberate attacks: if a Dolt user can craft datasets for which single row changes translate to a duplication of the entire tree (and dataset), this becomes an obvious DOS vector for hosted Dolt platforms.
From what I've seen the likelihood of triggering a pathological case with real-world non-malicious data is actually low enough to be ignored, given that the rolling hash function is well-crafted. I do agree that crafting malicious data to break deduplication in Dolt should be relatively easy, but I do not see how this could lead to DOS on e.g. a hosted Dolt platform. If I understand correctly, your proposed attack would only affect the rate of deduplication and by extension disk space used, and I would expect a hosted Dolt platform to have strict disk-space limits or use storage-based billing.