|
|
|
|
|
by exDM69
871 days ago
|
|
I think that's the benchmarks I allude to in the GP post. Table 1 on page 3 is absolute gold, it quantifies the indirect costs by listing the number of cache lines and TLB entries evicted. The numbers are much larger than I remembered. According to the table, the simplest syscall tested (stat) will evict 32 icache lines (L1), a few hundred dcache lines (L1), hundreds of L2 lines and thousands of L3 lines, and about twenty TLB entries. After returning from said syscalls, you'll pay a cache miss for every line evicted. Also worth noting that inside the syscall, the instructions per clock (IPC) is less than 0.5. When the CPU is happy, you generally see IPC figures around 2 to 3. |
|
Anyway, a 20-line example of a program written against said interpreter is https://github.com/c-blake/batch/blob/1201eefc92da9121405b79... but that only needs the wdcpy fake syscall not the conditional jump forward (although that could/should be added if the open can succeed but the mmap can fail and you want the close clean-up also included in the batch, etc., etc.).
I believe Cassyopia (also mentioned in Soares) hoped to be able to analyze code in user-space with compiler techniques to automagically generate such programs, but I don't know that the work ever got beyond a HotOS paper (i.e. the kinda hopes & dreams stage) and it was never clear how fancy the anticipated batches being. The Xen/VMware multi-calls Soares2010 also mentions do not seem to have inline copy/jumps, though I'd be pretty surprised if that little kernel module is the only example of it.