| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kilburn 2571 days ago

> most files are nowhere near 10 levels deep, let alone hundreds of levels.

Filesystem (and low-level in general) stuff must consider worst cases. There is a lot of software out there doing weird things. For instance, npm created a very deep folder hierarchy for a long time (so deep it messed up some path length restrictions in fact).

One way or another the worst case is going to be hit. And then what? The entire computer grinds to a halt? How is the user supposed to discover why?

> We can amortize the cost by updating the parent folder sizes only when the file size is changed by a significant amount since last update (say, 10%).

So you have a log file inside a folder. Because it just keeps growing line by line then the folder's size is never updated. Now you have many Gb's of log file in that folder, and the folder says it is using "4Kbytes".

Moreover, this propagates upwards the filesystem. In the end, your root drive has a "folder size" of X but its actual usage is Y >> X. How is that not going to confuse everyone?

Put in another way, who is going to trust the X number ever? Why would you pay all that accounting penalty for every write to every file to end up with a half-asset broken-by-design feature?

> Also I don't see how drives and hard linked files are any different than regular files in this context.

They are different in that they exist in multiple folders at the same time (so they would trigger multiple size-updating branches).

1 comments

hexomancer 2571 days ago

> So you have a log file inside a folder. Because it just keeps growing line by line then the folder's size is never updated. Now you have many Gb's of log file in that folder, and the folder says it is using "4Kbytes".

You misunderstood what I meant. I didn't mean we should only update if a single change is significant. I meant we only update when the cumulative changes since last update is significant. An example:

Let's say we create a 1mb file. The next time the file is changed, we only update the parent if the change is more than 10% (the new size is greater than 1.1mb or less than 0.9mb). For example if it is a log file and each line is 100 bytes, we update the parent after 1000 new lines (even though the 1000th line is still only 100 bytes). (of course the number of lines before the update depends on the size of the file, so if we had a 1GB log file, we would update after 1000000 new lines).

It is trivial to prove that the estimate is never off by a factor of more than 10% (even for the parent folders). So this is not a "half-assed broken-by-design feature" since it provides strong guarantees in bounds and at least in my personal day-to-day usage, I almost never care about the exact size of a folder but want to have a rough idea of how large it is.

link

kilburn 2570 days ago

I understand your design (and yes, it would work for the simple cases). Even then, it all still boils down to whether you want the overhead of extra computations for every write to get a [lower,upper] bound on the size of what every folder contains or not.

Then there are the complex situations (this is just a small sample I can come up with right on the spot):

What happens when a file is hard-linked under the same ancestor folder? Should its size be counted once or twice?

How do you even know the parents of a file at write time? Current (unix) filesystems only store folder -> [inodes], where an inode in that list may be referenced by other folders. There is just no inode -> folder(s) where it is stored mapping that I know of.

And then there are bind-mounts (similar to "folder hard-links" but not quite), special files/devices, etc.

All in all, it is a huge mess for a questionable benefit. What actual use cases are just not possible without this feature?

link

gregmac 2570 days ago

You still have to read the stored size on each parent to figure this out, at which point the optimization makes sense only if writes are significantly (at least 2x) more expensive than reads, and this is not true for most desktop PCs.

This really boils down to a caching problem, and well, there's a reason it's one of the two hard computer science problems..

link

hexomancer 2570 days ago

Nope. We only have to read the parent of the last changed folder. Plus, any filesystem worth its salts already caches the frequently used parts.

link

ComputerGuru 2570 days ago

If 10 files change by n bytes each, none of which reach your threshold for updating the parent individually, where are you storing the amount each file was changed since the last parent folder update until you deign it appropriate to update the parent? Your design makes no sense.

link

hexomancer 2570 days ago

> where are you storing the amount each file was changed since the last parent folder update

The same place we store the rest of the file attributes?

I think my design does make a lot of sense but you are actively trying to not understand it.

link

ComputerGuru 2569 days ago

Then each time one file changes you need to read all other files in the same folder to determine if the net change satisfies the increment condition for the parent folder!

link