Hacker News new | ask | show | jobs
by lukesandberg 4431 days ago
to do that, wouldn't you have to look at every byte just to detect the runs of 0s, that would mean that you have to pull the whole file through the memory hierarchy of your system (rather than just passing chunks from syscall to syscall) wouldn't that alone slow you down significantly?
2 comments

It depends. If the data is coming from a pipe (like core_pattern) then yes you have to check for runs of zeroes. If it's coming from a filesystem, then there are various system calls that let you skip them (specifically SEEK_HOLE and SEEK_DATA flags of lseek(2)).

Also if the data is being copied into userspace anyway, then it's quite fast to check that memory is zero. There's no C "primitive" for this, but all C compilers can turn a simple loop into relatively efficient assembler[1].

If you're using an API that never copies the data into userspace and you have to read from a pipe, then yes sparse detection will be much more expensive.

In either case it should save disk space for core files which are highly sparse.

[1] https://stackoverflow.com/a/1494021

The easiest way to handle things like sparse files correctly is to invoke a program like GNU dd that already has this feature built in. GNU cp handles it, too, but it doesn't accept input from stdin.
Right. Sparse files are normally written by applications or kernel threads that specifically know the defined byte ranges, and define new allocated parts of the file. Further, file allocations are probably block-sized, so you would need to ensure the byte regions of blocks were all zero.

This could be done quickly in the kernel. RAID (which does pass the data through multiple transformations) subsystem metrics printed at boot demonstrate that.