| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pjdesno 1252 days ago

Nowadays ext4 has dir_index enabled by default, so it uses hashed B-trees for its directories.

Ric Wheeler posted this nearly a decade and a half ago: "Strangely enough, I have been testing ext4 and stopped filling it at a bit over 1 billion 20KB files on Monday (with 60TB of storage)." and goes on to describe some performance numbers - which would be a lot better on modern hardware. https://listman.redhat.com/archives/ext3-users/2009-Septembe... There's a talk about it, as well: https://lwn.net/Articles/400629/

Unfortunately it seems like a lot of applications (unfortunately including ls) default to rather inefficient ways of enumerating files in a directory.

5 comments

aftbit 1252 days ago

Thanks for sharing those articles. I found some of the specifics of price and size amusing in retrospect.

> With regard to solid-state storage, Ric noted only that 1Tb still costs a good $1000. So rotating media is likely to be with us for a while.

Here in 2023, 1 TB fast flash costs ~$50-100.

> What if you wanted to put together a 100Tb array on your own? They did it at Red Hat; the system involved four expansion shelves holding 64 2Tb drives. It cost over $30,000, and was, Ric said, a generally bad idea. Anybody wanting a big storage array will be well advised to just go out and buy one.

Nowadays, you can get 16 TB spinning rust disks for ~$15/TB, so a 96 TB array (without redundancy) would take 6 disks and cost $1500. If you wanted to use mirrors for speed and simple redundancy, you could build a full NAS that fits in 2U with a flash cache in front of 12 x 16 TB disks for well under $5000.

link

cduzz 1252 days ago

Yeah, directories with a gazillion files tend to work okay if you already know the name of the file you're working with. ls has a -f flag which turns of sorting and turns off looking at the inode of the underlying files (is it a directory or a socket? ls needs to know to set the right color).

The there's also fun to be had when you start deleting those things. I've lost count of the number of times people are surprised that a delete is as expensive as a create or other io operation.

link

sliken 1252 days ago

Sure, that helps, for looking up a single file. Doesn't however help with ls or du. Even things like what are the 10 biggest files in this directory are painful.

I've seen numerous efforts (Microsoft and BeOS spring to mind) to replace the filesystem with a database. Not aware of any big successes though.

link

lamontcg 1252 days ago

The bigger problem is backing things up.

The old image level 3 servers at Amazon were just image files layed down in a filesystem (hashed, with a directory heirarchy so that massive numbers of files per directory were not the issue). The problem that it reached was that you couldn't ever take one of them offline and you couldn't stream off of one of the block devices, so you were stuck enumerating through all the files. Those were something like 32kB average filesystem (or possibly slightly smaller). And that was on spinning rust with something like a 4ms seek time between files, and the end result was something like a couple months to go through the whole filesystem.

This is why the GoogleFS paper uses chunk sizes of something like 64MB so that data can be efficiently streamed.

link

lamontcg 1250 days ago

*32kB average filesize

link

ak217 1252 days ago

ext4 still has an inode storage quota that is a fixed fraction of the total space available, so for small files, you will run out of inode space before you run out of block storage space. This alone makes xfs (which dynamically allocates inode storage) more reliable in these scenarios.

link

sidpatil 1252 days ago

> ext4 still has an inode storage quota that is a fixed fraction of the total space available, so for small files, you will run out of inode space before you run out of block storage space.

This would explain why Web hosting providers often include inode limits in their terms of service.

link

ilaksh 1252 days ago

Maybe use exa?

link