| ripgrep doesn't require memory maps, and if they fail to open, it will fall back to a more traditional buffering strategy: https://github.com/BurntSushi/ripgrep/blob/50d2047ae2c0ce2ed... ripgrep has always had a fast traditional buffering strategy using `read` calls for searching, because I knew that mmap couldn't be used in every case. Anyway, this has been fixed for a couple years at this point, so if you're still experiencing a problem, then please file a new bug report. > As a more recent example, ripgrep had issues on 32-bit platforms because of a bug in the way the underlying mmap library worked in Rust. This is false. The bug you're thinking about is probably https://github.com/BurntSushi/ripgrep/issues/922, which was not caused by an underlying bug in memmap. memmap did have an underlying bug with respect to file offsets, but ripgrep did not use the file offset API. The bug was caused in ripgrep itself, since I made the classic mistake of trying to predict whether an mmap call would fail instead of just trying mmap itself. That bug was fixed on master before the Windows bug was even reported: https://github.com/BurntSushi/ripgrep/commit/93943793c314e05... > You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did. This isn't exclusive to Rust programmers. C tools make the same mistake all the time. Because memory maps aren't just problematic with large files on 32-bit systems, but they also don't work with virtual files on Linux. Try, for example, `ag MHz /proc/cpuinfo` and see what you get. Crazy how, you know, sometimes humans make mistakes even if they are a C programmer! And the implication that I (or the author of memmap) never read the docs for `mmap` is just absurd. If you're going to be snooty about stuff like this, then at least get the story correct. Or better yet, don't be snooty at all. |
The specific issue making the "mmap" crate incorrect was that it used a "usize" instead of "u64" for some of the functions, limiting it to 4GB files on 32-bit platforms. I believe it's this line of code: https://github.com/rbranson/rust-mmap/blob/f973ae1969b4b7e80...
Now, I'm not a mindreader, but to me this feels an awful lot like its author made a tacit assumption that mmap() is a "memory operation" that is tied to the architecture's pointer size. In similar conversations, heck, in this very discussion people were incredulous that a file can be bigger than memory and be processed.
I absolutely believe that people do not read much past the function declarations, and it might be a "snooty attitude" but experience unfortunately has shown it to be an accurate attitude.
I'm also not accusing you of incorrectly using mmap(), buuuuut... having a quick flip through your current code I see that you still have the attitude that "mmap() takes a filename and makes into a slice that the kernel magically reads in for me on demand".
This is just not true, not even on 64-bit platforms. On smaller devices with only 2-4GB of memory, it's entirely possible to simply run out of page table entries (PTEs). It's possible the memory space simply gets too fragmented. It's possible the kernel has other limits for processes. It's possible the that file is some virtual device with an enormous reported size. Etc, etc, etc...
The correct usage of mmap() is to use moderately-sized sliding windows of, say, 128MB at a time or whatever.
But, having said that: Your code is now correct in the sense that it won't crash, it won't have unsafety, it'll run on 32-bit just fine, and will probably work for all practical scenarios that people want to use a grep tool for. I also know that you have specific optimisations for "the whole file fits in a byte slice", so there's benefits to using the simple approach instead of a sliding window.
However, if this was a database engine that required mmap() to work, it would be absolutely incorrect. But it isn't a database engine, so no big deal...