Hacker News new | ask | show | jobs
by Thaxll 1374 days ago
How can it be faster than a static page that is already in memory, the bytes are there you just send them over a socket? Transforming some template to rust code back to string buffer is somehow faster?
6 comments

How can it be faster than a static page that is already in memory, the bytes are there you just send them over a socket? Transforming some template to rust code back to string buffer is somehow faster?

I don't think the author is claiming it is faster than a static site stored in memory, they're saying it is faster than a traditional static site that loads files from the disk. At least that's how I read it.

That “traditional” site doesn’t actually load the data from disk, in practice. It does once, after a reboot, but that’s true for this solution’s executable file as well.
That “traditional” site doesn’t actually load the data from disk, in practice. It does once, after a reboot, but that’s true for this solution’s executable file as well.

Does Apache/Nginx/IIS load static files in memory ahead of time? I would assume no, unless someone went through and did some optimizations. Even so, there is always a point where memory runs out, and in that case a templating engine is essentially compression. I would assume if the author outputted his whole website as static files and stored them in memory it would be even faster, but that would require quite a bit more memory.

> Does Apache/Nginx/IIS load static files in memory ahead of time?

Linux loads them on the first usage. If you have enough memory, they'll just stay there. It doesn't that much memory, most sites are pretty small.

But the article's way doe use less memory, less system calls, and is completely optimized for that one site only. So yeah, it will surely be faster. Besides, his site appears to not be static.

> Linux loads them on the first usage.

Yes, but.

The problem with OS file caches has ever been that people look at a box, see that the programs aren't consuming all of the available memory, and argue that they should be able to cram more shit on the box because it's 'underutilized'.

There are very reasonable and sane system architectures that let the OS handle caching, but you need a way to defend against these sorts of situations.

The performance falloff for this failure mode is exponential, so people try it a few times, and not getting any negative feedback, they add it to their toolbox only to get lectured months later once the bad behavior has not only become standard for them but also spread to other people.

It almost begs for a different system call that can earmark the memory usage by the app in a way that's easier for people to see.

With Apache/Nginx e.t.c. a file is cached by VM/FS on the first request and will stay in RAM for a long time unless there is a memory pressure. For most sites this is good enough. For cases where it isn't one can pre-load files after a reboot using find /path -type f -exec cat {} + > /dev/null.
Author here. I don't identify as male. It would be nice if you could update your comment to not make a factual error when referring to me. Please use https://pronoun.is/they.

Thanks!

They don't, but OS can pre-fetch files ahead of time. Zfs will load "hot" files as soon as it can.

You can also easily preload things into memory in boot yourself, so static websites usually don't serve files from disk.

My thoughts as CDN engineer:

It can be a tiny amount more efficient since an async disk IO implementation might dispatch the file read() call to a thread pool, wait for the result, and then send the data back to the client. Makes 2 extra context switches compared to sending data from memory. Now if the user is super confident that the data is hot and in page cache then a synchronous disk read will fix the problem. Or trying a read with RWF_NOWAIT and only falling back to a thread pool if necessary.

On the other hand rendering a template on each request also requires CPU, which might be either more or less expensive than doing a syscall.

All in all the efficiency differences are likely negligible unless you run a CDN which does thousands of requests per seconnd.

In terms of throughput to the end user it will make zero measurable difference unless the box ran out of CPU.

The file is most likely cached in memory ( OS ) even if there is a read I assume it's going to be faster vs running some code in Rust.
On the one hand, sure, you can probably squeeze some cycle or two out of buffering everything in memory. Even though your disk read is a memory read in all likelihood given how filesystem caching works, it's still an IO call, which isn't free.

Keeping everything in user space buffers might just be faster.

On the other hand, you're sending that sucker over network, and what you save doing this is most likely best counted in microseconds/request. It's piss in the ocean compared to the delay introduced even over a local network.

> Even though your disk read is a memory read in all likelihood given how filesystem caching works, it's still an IO call, which isn't free.

I wonder if io_uring could be used to issue a single syscall that would read data from disk (actually using page cache) and send it on the network.

Of course, you could use DPDK or similar technologies to do the opposite - read the data from disk once and keep it in user-space buffers, then write it directly to NIC memory without another syscall. That should still theoretically be faster, since there would be 0 syscalls per request, where the other approach would require 1 per request.

> I wonder if io_uring could be used to issue a single syscall that would read data from disk (actually using page cache) and send it on the network.

Only if you don't care about HTTP/2 and TLS. And if you don't care about those, you can as well do sendfile() from a thread.

You can do kernel TLS for sendfile at least, maybe for io_uring too? Probably not for HTTP/2, but I'm not convinced multiplexed tcp in tcp is a good protocol for the public internet anyway.
That's indeed possible, if one has a TLS stack which supports KTLS. I however don't think there's not too many of those yet, and probably even less so in Rust where both the library and a potential Rust wrapper would need to support it.
My site does serve itself out of a UNIX socket, so sendfile() may actually work. But most of the data is served with handler functions though.
With a static page generator, the file is still on disk. You've got a read cache for the disk, but that's not entirely reliable.
Just like the Rust executable is still on disk. Sure, if it is running, it is memory mapped, but it can still be paged out. This is not theoretical. In practice, upon request, the probability of finding the static page in cache should be similar to the probability of the executable not being paged out. (That's true as long as the actual data is the same, and the differing factors, like the size of the web server executable, are small compared to the amount of free memory.)
Author here. That server doesn't have swap enabled. It can't be paged out.
If the executable is on disk, it can be paged out unless you've used mlock() to tell the kernel to keep it resident.
Interesting. I've never heard of that happening before. Can you link a reference to where I can find out more about that aspect of the linux memory subsystem?
The mlock manual[0] has a "notes" section that provides a good brief summary. The GNU libc manual has more than anyone would ever want to read about memory management, including a section on memory locking[1].

On an intuitive level, think of swap as being a place the kernel can put memory the program has written. When you malloc(4096) and write some bytes into it, the kernel can't evict that page to disk unless there's some swap space to stick it in. However, executables are different because they're already on disk -- the in-memory version is just a cache (everything is cache (computers have too many caches)). The kernel is allowed to drop the copy of the program it has in memory, because it can always read it back from the original executable.

[0] https://man7.org/linux/man-pages/man2/mlock.2.html

[1] https://ftp.gnu.org/old-gnu/Manuals/glibc-2.2.3/html_chapter...

It's one of the reasons why running without swap can have even worse pathological behaviour than running with swap. With swap the kernel can prioritise keeping code in RAM over little-used data, wheras without it when RAM fills up with data eventually the currently running hot code gets swapped out and performance completely tanks, meaning the system doesn't actually hit the nice OOM error you hope it would. (hence userspace utilities like earlyoom to kick in before the kernel's absolute last resort strategy).
I believe that when a file is mmap'd a page table is created for it in the . As you perform read/write on the file a fault loads the actual entries into that page table. As pages can be mapped they so too can be unmapped under pressure, without that falling back to swap (since it is already a file backed map, you wouldn't swap a file backed map to a different file after all).

There are a few relevant bits to this. You can MAP_POPULATE the file to prepopulate the entries and you can MAP_LOCKED to MAP_POPULATE + lock the pages in (unreliably). As mentioned in the man page for mmap MAP_LOCKED has some failure modes that you don't get with mlock.

https://www.man7.org/linux/man-pages/man2/mmap.2.html

I also found this page: https://eklitzke.org/mlock-and-mlockall

Oh, and this: https://access.redhat.com/documentation/en-us/red_hat_enterp...

If swap is off there is nowhere to page it out
I was thinking the same. They said a Go precompiled version was faster, but was 200MB. Which I don't understand.

200MB of pages and assets, sure. Code? No. If you compile it into the binary then the storage is no worse than having a small binary and all the resources separate.

Taking a statically generated site and returning the raw bytes is 100% faster. The author said so themselves.

Technically with go:embed you could trivially make a blog server that is just a 200MB binary with all of it embedded.

If you did it that way, now all your content is basically mmaped into the memory which means (probably) less syscalls.

Soo it might've shaved half a microsecond maybe ?

Well for one it's not static lol. I don't think they're claiming it's faster than a static website, are they?