Hacker News new | ask | show | jobs
by hexomancer 2571 days ago
> So you have a log file inside a folder. Because it just keeps growing line by line then the folder's size is never updated. Now you have many Gb's of log file in that folder, and the folder says it is using "4Kbytes".

You misunderstood what I meant. I didn't mean we should only update if a single change is significant. I meant we only update when the cumulative changes since last update is significant. An example:

Let's say we create a 1mb file. The next time the file is changed, we only update the parent if the change is more than 10% (the new size is greater than 1.1mb or less than 0.9mb). For example if it is a log file and each line is 100 bytes, we update the parent after 1000 new lines (even though the 1000th line is still only 100 bytes). (of course the number of lines before the update depends on the size of the file, so if we had a 1GB log file, we would update after 1000000 new lines).

It is trivial to prove that the estimate is never off by a factor of more than 10% (even for the parent folders). So this is not a "half-assed broken-by-design feature" since it provides strong guarantees in bounds and at least in my personal day-to-day usage, I almost never care about the exact size of a folder but want to have a rough idea of how large it is.

3 comments

I understand your design (and yes, it would work for the simple cases). Even then, it all still boils down to whether you want the overhead of extra computations for every write to get a [lower,upper] bound on the size of what every folder contains or not.

Then there are the complex situations (this is just a small sample I can come up with right on the spot):

What happens when a file is hard-linked under the same ancestor folder? Should its size be counted once or twice?

How do you even know the parents of a file at write time? Current (unix) filesystems only store folder -> [inodes], where an inode in that list may be referenced by other folders. There is just no inode -> folder(s) where it is stored mapping that I know of.

And then there are bind-mounts (similar to "folder hard-links" but not quite), special files/devices, etc.

All in all, it is a huge mess for a questionable benefit. What actual use cases are just not possible without this feature?

You still have to read the stored size on each parent to figure this out, at which point the optimization makes sense only if writes are significantly (at least 2x) more expensive than reads, and this is not true for most desktop PCs.

This really boils down to a caching problem, and well, there's a reason it's one of the two hard computer science problems..

Nope. We only have to read the parent of the last changed folder. Plus, any filesystem worth its salts already caches the frequently used parts.
If 10 files change by n bytes each, none of which reach your threshold for updating the parent individually, where are you storing the amount each file was changed since the last parent folder update until you deign it appropriate to update the parent? Your design makes no sense.
> where are you storing the amount each file was changed since the last parent folder update

The same place we store the rest of the file attributes?

I think my design does make a lot of sense but you are actively trying to not understand it.

Then each time one file changes you need to read all other files in the same folder to determine if the net change satisfies the increment condition for the parent folder!