| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ataylor284_ 3660 days ago
	> The real NtQueryDirectoryFile API takes 11 parameters Curiosity got the best of me here: I had to look this up in the docs to see how a linux syscall that takes 3 parameters could possibly take 11 parameters. Spoiler alert: they are used for async callbacks, filtering by name, allowing only partial results, and the ability to progressively scan with repeated calls.

3 comments

bitwize 3660 days ago

This is a recurring pattern in Windows development. Unix devs look at the Windows API and go "This syscall takes 11 parameters? GROAN." But the NT kernel is much more sophisticated and powerful than Linux, so its system calls are going to be necessarily more complicated.

trentnelson 3660 days ago

Curiosity got the better of me recently when I re-read Russinovich's [NT and VMS - The Rest Of The Story](http://windowsitpro.com/windows-client/windows-nt-and-vms-re...), and I bought a copy of [VMS Internals and Data Structures](http://www.amazon.com/VAX-VMS-Internals-Data-Structures/dp/1...).

Side-by-side, comparing VMS to UNIX, and VMS's approach to a few key areas like I/O, ASTs and tiered interrupt levels are simply just more sophisticated. NT inherited all of that. It was fundamentally superior, as a kernel, to UNIX, from day 1.

I haven't met a single person that has understood NT and Linux/UNIX, and still thinks UNIX is superior as far as the kernels go. I have definitely alienated myself the more I've discovered that though, as it's such a wildly unpopular sentiment in open source land.

Cutler got a call from Gates in 89, and from 89-93, NT was built. He was 47 at the time, and was one of the lead developers of VMS, which was a rock-solid operating system.

In 93, Linus was 22, and starting "implementing enough syscalls until bash ran" as a fun project to work on.

Cutler despised the UNIX I/O model. "Getta byte getta byte getta byte byte byte." The I/O request packet approach to I/O (and tiered interrupts) is one of the key reasons behind NT's superiority. And once you've grok'd things like APCs and structured exception handling, signals just seem absolutely ghastly in comparison.

filereaper 3660 days ago

Since we're going into the history of Windows NT, VMS and Dave Cutler. I'd like to highlight this classic book on the history of all three of the above[1]

It follows the same line of narrative as The Soul of a New Machine

[1] https://www.amazon.com/Showstopper-Breakneck-Windows-Generat...

trentnelson 3660 days ago

I freaking love Showstopper! Such a great book. It's probably the most information available on David Cutler anywhere.

The author also e-mailed me saying thanks when I tweeted him how much I liked the book, which I thought was super nice.

jen20 3660 days ago

I've never met a single person who understood what they were talking about and referred to a "UNIX kernels". It may be true that Linux was once less advanced than NT - this is no longer the case, despite egregious design flaws in things like epoll. It has simply never been true (for example) for the Illumos (nee Solaris) kernel.

ambrop7 3660 days ago

Which design faults do you think epoll specifically has?

I know there are lots of file descriptors not usable with epoll or rather async i/o in general and that sucks (e.g. regular files).

For networking, I find epoll/sockets nicer to work with than Windows' IOCP, because with IOCP you need to keep your buffers around until the kernel deems your operation complete. I think you have 3 options:

1) Design the whole application to manage buffers like IOCP likes (this propagates to client code because now e.g. they need to have their ring buffer reference-counted).

2) You handle it transparently in the socket wrapper code by using an intermediate buffer and expose a simple read()/write() interface which doesn't require the user to keep a buffer around when they don't want the socket anymore.

3) You handle it by synchronously waiting for I/O to be cancelled after using CancelIo. This sounds risky with potential to lock up the application for an unknown amount of time. It's also non-trivial because in that time IOCP will give you completion results for unrelated I/Os which you will need to buffer and process later.

On the other hand, with Linux such issues don't exist by design, because data is only ever copied in read/write calls which return immediately (in non-blocking mode).

geocar 3660 days ago

The cool thing is that by gifting buffers to the kernel as policy, the gets to make that choice.

In Linux, you don't, and userspace and kernelspace both end up doing unnecessary copying, and it means with shared buffers, something might poll and not-block, but actually block by the time you get around to using the buffer.

This is annoying, and it generally means you need more than two system calls on average for every IO operation in performance servers.

As a rule, you can generally detect "design faults" by the number of competing and overlapping designs (select, poll, epoll, kevent, /dev/poll, aio, sigio, etc, etc, etc). I personally would have preferred a more fleshed out SIGIO model, but we got what we got...

One specific fault of epoll (compared to near-relatives) is that the epoll_data_t is very small. In Kqueue you can store both the file descriptor with activity (ident) as well as a few bytes of user data. As a result, people use a heap pointer which causes an extra stall to memory. Memory is so slow...

ambrop7 3660 days ago

> something might poll and not-block, but actually block by the time you get around to using the buffer

On Linux if a socket is set to non-blocking it will not block. I don't really understand your point with shared buffers. You wouldn't typically share a TCP socket since that would result in unpredictable splitting/joining of data.

> In Linux, you don't, and userspace and kernelspace both end up doing unnecessary copying

I'm not so sure the Linux design where copies are done in syscalls must be inherently less efficient. I'm pretty sure with either design, you generally need at least one memcpy - for RX, from the in-kernel RX buffers to the user memory, and for TX, from user memory to the in-kernel TX buffers. I think getting rid of either copy is extremely hard and would need extremely smart hardware, especially the RX copy (because the Ethernet hardware would need to analyze the packet and figure out where the final destination of the data is!). Getting rid of TX copy might be easier but still hard because it'd need complex DMA support on the Ethernet card that could access potentially unaligned memory addresses. On the other hand, I also don't think you need more than one copy, if you design the network stack with that in mind.

> you need more than two system calls on average for every IO operation in performance servers.

True but it's not obvious that this is a performance bottleneck. Consider that a single epoll wait can return many ready sockets. I think theoretically it would hurt latency rather than throughput.

> As a rule, you can generally detect "design faults" by the number of competing and overlapping designs.

On Linux, I think for sockets, there are only: blocking, select, poll, epoll. And the latter three are just different ways to do the same thing. On Windows, it's much more complicated - see this list of different methods to use sockets (my own answer): http://stackoverflow.com/questions/11830839/when-using-iocp-...

> One specific fault of epoll (compared to near-relatives) is that the epoll_data_t is very small.

Pretty much universally when you're dealing with a socket, you have some nontrivial amount of data associated with it that you will need to access when it's ready for I/O, typically a struct which at least holds the fd number. Naturally you put a pointer to such a struct into the epoll_data_t. I don't see how one could do it more efficiently outside of very specialized cases.

trentnelson 3660 days ago

I qualified it as "Linux/UNIX kernel" because I wanted to emphasize the kernel and not userspace.

Solaris event ports are good, but they're still ultimately backed by a readiness-oriented I/O model, and can't be used for asynchronous file I/O.

binarycrusader 3660 days ago

Solaris event ports most certainly can be and are used for async I/O. I'm not sure how you can claim otherwise:

https://blogs.oracle.com/dap/entry/libevent_and_solaris_even...

https://blogs.oracle.com/praks/entry/file_events_notificatio...

And Solaris, (unlike Linux historically at least), supports async I/O on both files and sockets. Linux (historically) only supported it for sockets. I have no idea if Linux generally supports async I/O for files at this point.

trentnelson 3660 days ago

Let me rephrase it: there is nothing on any version of UNIX that supports an asynchronous file I/O API that integrates cleanly with the file system cache -- you can do signal based asynchronous I/O, but that isn't anywhere near as elegant as having a single system call that will return immediately if the data is available, and if not, sets up an overlapped operation and still returns immediately to the caller.

This isn't a terrible recap of async file I/O issues on contemporary operating systems: http://blog.libtorrent.org/2012/10/asynchronous-disk-io/

niels_olson 3660 days ago

Here's a nice graphical comparison of syscalls between Linux and Windows

http://www.visualcomplexity.com/vc/project.cfm?id=392

Are you saying the Windows flow looks like spaghetti only because the software tested software (Apache) wasn't designed for Windows?

trentnelson 3660 days ago

Heh, 10 years old, original link doesn't work, image is tiny. And it sounds like they were comparing Linux and Apache to IIS and Windows.

It's hard to evaluate this in any way more than "yeah that's a cute spaghetti diagram". If I wanted to drag Linux through the mud visually I'd depict how much time every socket I/O op spends in vfs/fsync stuff. (i.e. you can depict anything to make your point)

jsmeaton 3660 days ago

The second image is of IIS running on Windows, not Apache. Different software on different OSes. Regardless, it doesn't seem like parent is making the argument that the NT kernel isn't complicated - just that it is superior.

1024core 3660 days ago

The joke used to be: VMS++ --> WNT

CamperBob2 3660 days ago

Yep, that'd be the spiritual predecessor to Bing Is Not Google.

rmu09 3659 days ago

First there was HAL, superior to IBM.

adamnemecek 3660 days ago

Can I ask what else is on your reading list? I ended up buying the VMS Internals book.

Also do you have an opinion on BeOS?

trentnelson 3660 days ago

I was fascinated by BeOS in the late 90s when I had a lot of enthusiasm (and little clue). All their threading claims just sounded so cool. I was also really into FreeBSD from around 2.2.5 so I got to see how all the SMPng stuff (and kqueue!) evolved, as well as all the different threading models people in UNIX land were trying (1:1, 1:m, m:n).

NT solves it properly. Efficient multithreading support and I/O (especially asynchronous I/O) are just so intrinsically related. Trying to bend UNIX processes and IPC and signals and synchronous I/O into an efficient threading implementation is just trying to fit a square peg in a round hole in my opinion.

As for reading list... I've bought so many old books lately. Here's my "makes the short list" bookshelf: http://imgur.com/DfTUVQx

And the more ridiculous one that I use as a cover page on my resume: http://imgur.com/0u9OZcN

What things in particular are you interested in?

adamnemecek 3660 days ago

Lol, I've been saying something like this for some time.

Thanks for the answer. I guess what I'm interested is somewhat obscure/historical operating systems and also HW that are in some way superior to currently popular solutions. The more comparative the better.

Also your reading list has quite a few Oracle SQL entries so I'm guessing it's your preferred DB of choice. What features are you using that aren't available in MySQL or Postgres?

trentnelson 3660 days ago

As far as commercial vendors go I really preferred SQL Server from about version 2000 onwards. I got into Oracle again recently for a consulting project with a finance client (who basically have unlimited Oracle licenses) and really quite enjoyed it since the Oracle 7/8 days.

You can do some phenomenally sophisticated things... I extensively leveraged things like partitioning, parallel execution (dbms_parallel_execute!), lots of PL/SQL using the pipelined table cursor stuff, data mining stuff (dbms_frequent_itemset!), index-organized tables, and my god, bitmap indexes were a godsend, direct insert tricks for bulk data loading, external tables were fantastic (you can wrap a .csv in an external table and interact with it in parallel just like any other table -- great for ingesting large amounts of janky .csv data from other parts of the business).

The parallel execution and robust partitioning options were probably the most critical pieces that have no particularly good counterpart in open source land.

akavel 3660 days ago

Curious if you've ventured into (good/modern? maybe QNX?) microkernels at some point and have some thoughts on them by chance?

trentnelson 3659 days ago

I have not I'm afraid. Hard enough keeping how all the parts of NT work in my head at the same time ;-)

tremon 3660 days ago

But the NT kernel is much more sophisticated and powerful than Linux

That does not follow from the example. All it shows is that Microsoft prefers to put a lot of functionality in one interface, while Linux probably prefers low-level functions to be as small as possible, and probably offers things like filtering on a higher level (in glibc, for example).

Neither explanation has anything to do with sophistication. I personally believe that small interfaces are a better design.

bitwize 3660 days ago

Actually it does, as it mentioned that the extra parameters are for things like async callbacks and partial results.

The I/O model that Windows supports is a strict superset of the Unix I/O model. Windows supports true async I/O, allowing process to start I/O operations and wait on an object like an I/O completion port for them to complete. Multiple threads can share a completion port, allowing for useful allocation of thread pools instead of thread-per-request.

In Unix all I/O is synchronous; asynchronicity must be faked by setting O_NONBLOCK and buzzing in a select loop, interleaving bits of I/O with other processing. It adds complexity to code to simulate what Windows gives you for real, for free. And sometimes it breaks down; if I/O is hung on a device the kernel considers "fast" like a disk, that process is hosed until the operation completes or errors out.

piscisaureus 3660 days ago

I wrote the windows bits for libuv (node.js' async i/o library), so I have extensive experience with asynchronous I/O on Windows, and my experience doesn't back up parent's statement.

Yes, it's true that many APIs would theoretically allow kernel-level asynchronous I/O, but in practice the story is not so rosy.

* Asynchronous disk I/O is in practice often not actually asynchronous. Some of these cases are documented (https://support.microsoft.com/en-us/kb/156932), but asychronous I/O also actually blocks in cases that are not listed in that article (unless the disk cache is disabled). This is the reason that node.js always uses threads for file i/o.

* For sockets, the downside of the 'completion' model that windows is that the user must pre-allocate a buffer for every socket that it wants to receive data on. Open 10k sockets and allocate a 64k receive buffer for all of them - that adds up quickly. The unix epoll/kqueue/select model is much more memory-efficient.

* Many APIs may support asynchronous operation, but there are blatant omissions too. Try opening a file without blocking, or reading keyboard input.

* Windows has many different notification mechanisms, but none of them are both scalable and work for all types of events. You can use completion ports for files and sockets (the only scalable mechanism), but you need to use events for other stuff (waiting for a process to exit), and a completely different API to retrieve GUI events. That said, unix uses signals in some cases which are also near impossible to get right.

* Windows is overly modal. You can't use asynchronous operations on files that are open in synchronous mode or vice versa. That mode is fixed when the file/pipe/socket is created and can't be changed after the fact. So good luck if a parent process passes you a synchronous pipe for stdout - you must special case for all possible combinations.

* Not to mention that there aren't simple 'read' and 'write' operations that work on different types of I/O streams. Be ready to ReadFileEx(), Recv(), ReadConsoleInput() and whatnot.

IMO the Windows designers got the general idea to support asynchronous I/O right, but they completely messed up all the details.

trentnelson 3660 days ago

You're completely missing how the NT I/O subsystem works, and how to use it optimally.

> * Asynchronous disk I/O is in practice often not actually asynchronous. Some of these cases are documented (https://support.microsoft.com/en-us/kb/156932), but asychronous I/O also actually blocks in cases that are not listed in that article (unless the disk cache is disabled). This is the reason that node.js always uses threads for file i/o.

The key to NT asynchronous I/O is understanding that the cache manager, memory manager and file system drivers all work in harmony to allow a ReadFile() request to either immediately return the data if it is available in the cache, and if not, indicate to the caller that an overlapped operation has been started.

Things like extending a file, opening a file, that's not typically hot-path stuff. If you're doing a network oriented socket server, you would submit such a blocking operation to a separate thread pool (I set up separate thread pools for wait events, separate to the normal I/O completion thread pools), and then that I/O thread moves on to the next completion packet in its queue.

> * For sockets, the downside of the 'completion' model that windows is that the user must pre-allocate a buffer for every socket that it wants to receive data on. Open 10k sockets and allocate a 64k receive buffer for all of them - that adds up quickly. The unix epoll/kqueue/select model is much more memory-efficient.

Well that's just flat out wrong. You can set your socket buffer size as large or as small as you want. For PyParallel I don't even use an outgoing send buffer.

Also, the new registered I/O model in 8+ is a much better way to handle socket buffers without the constant memcpy'ing between kernel and user space.

> IMO the Windows designers got the general idea to support asynchronous I/O right, but they completely messed up all the details.

I disagree. Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.

haberman 3660 days ago

> The key to NT asynchronous I/O is understanding that the cache manager, memory manager and file system drivers all work in harmony to allow a ReadFile() request to either immediately return the data if it is available in the cache, and if not, indicate to the caller that an overlapped operation has been started.

The Microsoft article cited above (https://support.microsoft.com/en-us/kb/156932) directly contradicts you:

> Be careful when coding for asynchronous I/O because the system reserves the right to make an operation synchronous if it needs to. Therefore, it is best if you write the program to correctly handle an I/O operation that may be completed either synchronously or asynchronously.

Microsoft is directly saying that it reserves the right to violate the guarantee you are counting on at any time, and it documents several known cases of this. You can try to guess when this will happen and put those I/O operations on a different thread pool, but you're just playing whack-a-mole. And you're violating Microsoft's own recommendations.

4ad 3660 days ago

> Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.

I wrote Windows drivers and file systems for about 10 years, and Unix drivers and file systems also for about 10 years.

I'd rather practice substance agriculture for the rest of my life than deal with Windows drivers again.

thwarted 3660 days ago

I disagree. Write a kernel driver on Linux and NT and you'll see how much more superior the NT I/O subsystem is.

Can programming against the userspace interface the I/O subsystem really be compared to programming against the kernel driver interface to I/O subsystem? In Linux, kernel drivers have access to structures, services, and layers that userspace doesn't. And can these be compared between a monolithic and a micro-kernel approach, other than what has been debated ad nauseam for micro/monolithic kernels in general (not just used for I/O)?

SwellJoe 3660 days ago

I have never worked with systems-level Windows programming, so I don't know the answer to this...but, how is what you're describing better than epoll or aio in Linux or kqueues on the BSDs?

I'm guessing you're coming from the opposite position of ignorance I am (i.e. you've worked on Windows, but not Linux or other modern UNIX), though, since "setting O_NONBLOCK and buzzing in a select loop, interleaving bits of I/O with other processing" doesn't describe anything developed in many, many years. 15 years ago select was already considered ancient.

MarkSweep 3660 days ago

I think IO Completion Ports [1] in Windows are pretty similar to kqueue [2] in FreeBSD and Event Ports [3] in Illumos & Solaris. All of them are unified methods methods for getting change notifications on IO events and file system changes. Event Ports and kqueue also handle unix signals and timers.

Windows will also take care of managing a thread pool to handle the event completion callbacks by means of BindIoCompletionCallback [4]. I don't think kqueue or Event Ports has a similar facility.

[1]: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3... [2]: https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2 [3]: https://illumos.org/man/3C/port_create [4]: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

trentnelson 3660 days ago

BindIoCompletionCallback is very old, the new threadpool APIs should be used, e.g.: https://github.com/pyparallel/pyparallel/blob/branches/3.3-p...

Regarding the differences between IOCP and epoll/kqueue, it all comes down to completion-oriented versus readiness-oriented.

https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...

trentnelson 3660 days ago

To quote myself:

> The “Why Windows?” (or “Why not Linux?”) question is one I get asked the most, but it’s also the one I find hardest to answer succinctly without eventually delving into really low-level kernel implementation details. >

> You could port PyParallel to Linux or OS X -- there are two parts to the work I’ve done: a) the changes to the CPython interpreter to facilitate simultaneous multithreading (platform agnostic), and b) the pairing of those changes with Windows kernel primitives that provide completion-oriented thread-agnostic high performance I/O. That part is obviously very tied to Windows currently. >

> So if you were to port it to POSIX, you’d need to implement all the scaffolding Windows gives you at the kernel level in user space. (OS X Grand Central Dispatch was definitely a step in the right direction.) So you’d have to manage your threadpools yourself, and each thread would have to have its own epoll/kqueue event loop. The problem with adding a file descriptor to a per-thread event loop’s epoll/kqueue set is that it’s just not optimal if you want to continually ensure you’re saturating your hardware (either CPU cores or I/O). You need to be able to disassociate the work from the worker. The work is the invocation of the data_received() callback, the worker is whatever thread is available at the time the data is received. As soon as you’ve bound a file descriptor to a per-thread set, you prevent thread migration >

> Then there’s the whole blocking file I/O issue on UNIX. As soon as you issue a blocking file I/O call on one of those threads, you have one thread less doing useful work, which means you’re increasing the time before any other file descriptors associated with that thread’s multiplex set can be served, which adversely affects latency. And if you’re using the threads == ncpu pattern, you’re going to have idle CPU cycles because, say, only 6 out of your 8 threads are in a runnable state. So, what’s the answer? Create 16 threads? 32? The problem with that is you’re going to end up over-scheduling threads to available cores, which results in context switching, which is less optimal than having one (and only one) runnable thread per core. I spend some time discussing that in detail here: https://speakerdeck.com/trent/parallelism-and-concurrency-wi.... (The best example of how that manifests as an issue in real life is `make –jN world` -- where N is some magic number derived from experimentation, usually around ncpu X 2. Too low, you’ll have idle CPUs at some point, too high and the CPU is spending time doing work that isn’t directly useful. There’s no way to say `make –j[just-do-whatever-you-need-to-do-to-either-saturate-my-I/O-channels-or-CPU-cores-or-both]`.) >

> Alternatively, you’d have to rely on AIO on POSIX for all of your file I/O. I mean, that’s basically how Oracle does it on UNIX – shared memory, lots of forked processes, and “AIO” direct-write threads (bypassing the filesystem cache – the complexities of which have thwarted previous attempts on Linux to implement non-blocking file I/O). But we’re talking about a highly concurrent network server here… so you’d have to implement userspace glue to synchronize the dispatching of asynchronous file I/O and the per-thread non-blocking socket epoll/kqueue event loops… just… ugh. Sure, it’s all possible, but imagine the complexity and portability issues, and how much testing infrastructure you’d need to have. It makes sense for Oracle, but it’s not feasible for a single open source project. The biggest issue in my mind is that the whole thing just feels like forcing a square peg through a round hole… the UNIX readiness file descriptor I/O model just isn’t well suited to this sort of problem if you want to optimally exploit your underlying hardware. >

> Now, with Windows, it’s a completely different situation. The whole kernel is architected around the notion of I/O completion and waitable events, not “file descriptor readiness”. This seems subtle but it pervades every single aspect of the system. The cache manager is tightly linked to the memory management and I/O manager – once you factor in asynchronous I/O this becomes incredibly important because of the way you need to handle memory locking for the duration of the I/O request and the conditions for synchronously serving data from the cache manager versus reading it from disk. The waitable events aspect is important too – there’s not really an analog on UNIX. Then there’s the notion of APCs instead of signals which again, are fundamentally different paradigms. The digger you deep the more you appreciate the complexity of what Windows is doing under the hood. >

> What was fantastic about Vista+ is that they tied all of these excellent primitives together via the new threadpool APIs, such that you don’t need to worry about creating your own threads at any point. You just submit things to the threadpool – waitable events, I/O or timers – and provide a C callback that you want to be called when the thing has completed, and Windows takes care of everything else. I don’t need to continually check epoll/kqueue sets for file descriptor readiness, I don’t need to have signal handlers to intercept AIO or timers, I don’t need to offload I/O to specific I/O threads… it’s all taken care of, and done in such a way that will efficiently use your underlying hardware (cores and I/O bandwidth), thanks to the thread-agnosticism of Windows I/O model (which separates the work from the worker). >

> Is there something simple that could be added to Linux to get a quick win? Or would it require architecting the entire kernel? Is there an element of convergent evolution, where the right solution to this problem is the NT/VMS architecture, or is there some other way of solving it? I’m too far down the Windows path now to answer that without bias. The next 10 years are going to be interesting, though.

https://groups.google.com/forum/#!topic/framework-benchmarks...

wahern 3659 days ago

"The problem with adding a file descriptor to a per-thread event loop’s epoll/kqueue set is that it’s just not optimal if you want to continually ensure you’re saturating your hardware (either CPU cores or I/O)"

Both epoll and kqueue permit multiple threads to poll the same event set. Normally you do this in tandem with edge-triggered readiness (EPOLLET on Linux, EV_CLEAR on BSD) so that only one thread will dequeue an event.

How do think IOCP is implemented in Windows? There's a thread pool in the kernel which _literally_ polls a shared event queue. It's just hidden so you can pretend it's magical. But conceptually it works almost identically to how you would do it in Unix.

The benefit of IOCP is that it's a native API. It's warts and shortcomings notwithstanding, developers never even need to think about how it's actually implemented. Whereas with epoll and kqueue you either have to roll your own framework, or select from various third-party options. Seeing how the sausage is made can turn some people off. But just because you don't see the gory details doesn't mean it's implemented using magical fairy dust.

There's much to recommend Windows, and many things the NT kernel conceptually gets right. But IOCP vs polling? The only real difference architecturally is how much of the stack sits in user-space vs kernel-space, and how much of the stack is delivered by Microsoft (all of if in the case of IOCP) vs other sources (in Linux, glibc does AIO, while all the event loop and callback code is provided by various libraries or written yourself).

Putting more of the stack in kernel-space doesn't magically make it easier to perform optimizations. That's marketing speak and kernel fetishism. You have to first show why those optimizations can't be achieved elsewhere, like in the I/O or process scheduler. Various Linux components traditionally are more performant (e.g. process scheduling) than in Windows, so many of the optimizations wrt IOCP is arguably clawing back performance lost elsewhere in the system.

zxcvcxz 3660 days ago

http://pyparallel.org/wrk-rps-comparison2.svg

According to your website pretty much every other technology runs better on Linux than it does on Windows, and of course pyparallel runs better than everything you tested.

How can I run these tests my self? I specifically want to test it against golang.

abaines 3660 days ago

I believe Linux since 2.5 has 'proper' asynchronous system calls, io_getevents(2) and co.

Further information: https://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt

inopinatus 3660 days ago

Based on what you describe here, I'd say your comparative understanding is about two decades out of date. Async I/O has been a capability in various Unix and Unix-alike kernels for that long.

trentnelson 3660 days ago

As in signal based AIO? Have you ever tried to use it in a high performance network server, where you want to have reads also satisfied from the cache if possible?

Because that is like pulling teeth on UNIX. See: https://groups.google.com/forum/#!topic/framework-benchmarks...

ckaygusu 3660 days ago

I think the problem here is not a syscall taking 11 parameters, it's a syscall that merely lists what is inside a directory taking 11 parameters. ataylor_284 explained the reasons (how convincingly, I'd argue) but on the first sight that surely smells bloat.

I'd also object NT kernel being more "powerful". Sure unixy kernels and NT has their differences but I don't think either one is superior.

bigger_cheese 3660 days ago

11 parameters may seem bloated but in some cases Unix syscalls weren't designed with enough parameters whcih caused a bunch of pain necessitating things like

dup->dup2->dup3 pipe->piep2 rename->renameat->renameat2

Best practice nowadays in linux is to allow overloading syscalls via a flags parameter.

see https://lwn.net/Articles/585415/

So modern linux syscalls may be bloated too.

rbanffy 3660 days ago

I remember the struct I had to populate to start a new process in 1997 or 1998...

darkengine 3660 days ago

It may be more "sophisticated" (sounds like a more positive synonym of "complex" to me), but I certainly don't think it's more powerful.

deprave 3660 days ago

Since when is kernel complexity a measure of quality...? :)

jasonm23 3660 days ago

hmmm...

Usage of the adjective "sophisticated" always precedes an outpouring either ignorance or straight bs.

pjmlp 3660 days ago

Also UNIX devs seem to forget how cumbersome the X11, Xlib and Motif APIs are.

pbarnes_1 3660 days ago

This was maybe the case at Linux 2.0, but is not the case now.

Also, Windows development is infinitely more painful than Unix/Linux.

uudecode 3660 days ago

"... so its system calls are going to be necessarily more complicated."

Are you implying that an increase in "power" can never be achieved through increasing simplicity?

bitwize 3660 days ago

That's the thing. Just by glancing at the API docs, Windows looks more complicated but where the rubber meets the road in terms of real high-performance application development, Windows is way simpler. In Windows you can do in one syscall what would take several in Linux. You can schedule I/O calls across multiple threads in a completely thread-safe manner without having to manage the synchronization yourself -- and since threads go to sleep entirely while waiting for I/O operations to complete, there is no chewing up CPU cycles in a select/epoll loop. So yes, writing "hello world" or simple filters is simpler in Unix -- but writing multithreaded server applications that maximize throughput is simpler in Windows.

Unix is bristling with features designed to "allow you to save me some time". It was designed to make it easy to write quick, "one-off" programs in C. VMS -- the predecessor to Windows NT -- was designed to run long-lasting, high-performance, high-reliability business applications for real users with money on the line (i.e., not just hackers) and Windows NT inherits this legacy.

uudecode 3660 days ago

"It was designed to make it easy to write quick, "one-off" programs in C."

OK, so it just so happens this is what I love to do. I like writing small programs and continually trying to improve them.

So I guess I should be a UNIX user?

Is NT not good for this too?

BTW, I do like VMS. But despite the NT kernel, using NT feels nothing like using VMS.

trentnelson 3660 days ago

I love writing NT-style C-level (no CRT, just pure C and whatever the CNF/Cutler Normal Form style is).

zxcvcxz 3660 days ago

>the NT kernel is much more sophisticated and powerful than Linux

Source?

It's not sophisticated enough or powerful enough to be the most used kernel on super computers (and in the world). Windows pretty much only dominates the desktop market. Servers, super computers, mainframes, etc, mostly use Linux.

A few years ago there was even a bug in Windows that caused degradation in network performance during multimedia playback that was directly connected with mechanisms employed by the Multimedia Class Scheduler Service (MMCSS), this is used on a lot of audio setups. If they can't even get audio setups right how can people consider anything Windows releases "sophisticated"?

It's made to do anything you throw at it I guess, it's definitely complicated, but powerful and sophisticated aren't words I would use to describe NT.

recursive 3660 days ago

If you're arguing in favor of linux, you probably shouldn't use any arguments that deal with getting audio setups right.

bitwize 3660 days ago

Indeed.

I would go so far as to say that a large part of why audio is such a CF under Linux is -- wait for it -- lack of real asynchronous I/O.

Audio is asynchronous by nature, and to do that right under Linux you need a "sound server" with all the additional overhead, jank, and shuffling of data between kernel and at least two different user spaces that implies. Audio under Linux was best with OSS, which was synchronous in nature and not suitable for professional applications. JACK mitigated that somewhat, but for an OS to do audio right you need a kernel-level async audio API like Core Audio or whatever Windows is doing these days.

makomk 3660 days ago

Windows has a sound server too, you know. I believe Core Audio on Mac does too. A large part of why audio is such a CF under Linux is that PulseAudio is incredibly badly written and poorly maintained. My favourite was the half-finished micro-optimization that broke some of the resamplers because the author got bored and never modified them to handle the change, which somehow made it all the way into a release. I dread to think what they'd do with powerful tools like asynchronous I/O.

cbd1984 3660 days ago

Audio on Linux works fine in my experience.

tacos 3660 days ago

Hey everyone, we found him!

madez 3660 days ago

I don't want to take sides in this discussion but share an anecdote. Hey, maybe even someone knows a solution for this.

I have a PC connected via on-board HDMI to a Denon AVR solely for the purpose of getting the audio to the amplifier. Windows doesn't let me use that audio interface without extending or mirroring my desktop to that HDMI port. Since there is no display connected to the AVR I don't want to extend the desktop, and mirroring heavily decreases performance of the system.

On Debian Sid the computer by default allows me to use the HDMI audio without doing anything to my desktop display. It seems the system realizes that there is no display connected to the AVR but it's still a valid sink for audio.

orbifold 3660 days ago

Well for various definitions of fine I guess.

cbd1984 3660 days ago

It works fine as in "I can listen to audio on my laptop from multiple programs at once, with a centralized way to control audio volume on a per-application or per-sound-device basis." I literally cannot imagine any audio system doing better than that given the hardware I have to work with.

zxcvcxz 3660 days ago

Works very well for me too. Don't know why you got downvoted. It's like it's 1994 in here...

xodjoshd 3660 days ago

But this is exactly what made me switch, windows was preventing me from accessing my sound card directly in order to record a remote interview.

I use Linux regularly to record and edit audio, it's free , it works, and I dont have to worry my OS is active reducing the functionality of my equipment.

Sanddancer 3660 days ago

They got audio setups right. The reason the network degradation happened is that video and audio playback were given realtime priority so background processes couldn't cause pops, stutters, etc. At the time Vista was released, most home users didn't have a gigabit network, so the performance degradation would only happen on a small number of users, and most would rather prefer good audio and video performance to a slowdown in network performance in a small percentage of users. With today's massively multicore systems, it's even less of an issue, while linux still has a problem with latency on applications like pro audio.

makomk 3659 days ago

The reason the network degradation happened is that Microsoft couldn't figure out how to stop heavy network activity causing audio glitches on some systems even after giving audio realtime priority, so they hacked around it by adding a fixed 10,000 packets-per-second cap on network activity regardless of system speed or CPU usage (less if you had multiple network adapters). See https://blogs.technet.microsoft.com/markrussinovich/2007/08/... This was just as much of an issue on multicore systems because the cap was unaffected by the system speed and chosen based on a slow single-core machine.

pbarnes_1 3660 days ago

I don't know why you're getting downvoted since the parent is basically stating random opinions about "power" and "sophistication" without anything to actually back it up.

11 param functions don't say "power" to me. They say "poorly thought out API design". Much can be said for most Windows APIs in general.

tptacek 3660 days ago

Overloaded system call entrypoints are a fact of life on all mainstream platforms. Consider for instance "ioctl".

marvy 3660 days ago

I've heard that Plan 9 doesn't have ioctl. But I guess that doesn't count as mainstream.

wahern 3659 days ago

Plan 9 replaces ioctl with special files that require writing magic incantations.

It's similar to the various knobs in Linux /proc which require reading and writing specially formatted data. ioctl is simpler in that you don't need to worry as much about formatting the data (the struct declarations take care of that for you), but a file-oriented interface is nicer in that it's a higher-level abstraction--for example, it maps better to different languages, similar to how ioctl requires C or C-like shims whereas /proc can be used from any language that understands open/read/write/close, including the shell.

deprave 3660 days ago

A lot of Microsoft APIs and subsystems are similarly bloated. There are probably tons of factors at play, but I believe being closed-source and having to support many individual use cases is one fundamental reason. (See for example CreateProcess vs. fork...)

bitwize 3660 days ago

When it comes to system call interfaces, it's because Dave Cutler has forgotten more than many modern "kernel hackers" will ever know about how to design an OS.

deprave 3660 days ago

I appreciate the name-dropping.

Dave Cutler's skills aside, Unix predates Windows by decades, and to anyone remotely familiar with kernel development it is clear that the sheer quantity and complexity of subsystems stem from the fact that nobody but Microsoft can actually see, modify, and redistribute Windows' source code.

Unless you can actually say "here's why Windows is qualitatively better" and point out specific tasks Windows does better, I'll just point you to the fact that the internet infrastructure and most of the servers on it, along with every Apple desktop and pretty much every mobile device, run Unix.

pjmlp 3660 days ago

> I'll just point you to the fact that the internet infrastructure and most of the servers on it, along with every Apple desktop and pretty much every mobile device, run Unix.

I wonder how much of the internet infrastructure would run Unix if free (as in beer) clones like *BSD and GNU/Linux did not exist in first place.

How much internet infrastructure would run actually Unix if ISPs had to choose between Aix, HP-UX, Solaris, Digital UX, Tru64 and Windows licenses?

Free is always more valued than quality.

kelnos 3660 days ago

I'd guess that, even more important than the cost, up until recently, interacting with Windows in a headless mode was next to useless. Most sysadmins in my experience avoid GUIs like the plague when managing servers.

jamespo 3660 days ago

Are you too young to remember "the network is the computer"?

And of course this worked in reverse, when Netscape released their commercial webserver Microsoft rushed to give away IIS.

pjmlp 3660 days ago

I started coding in the 80's and I remember how the UNIX market was afraid of Windows NT workstations, before they actually started losing market share to the free (beer) UNIX versions in form of BSD and GNU/Linux.

umanwizard 3660 days ago

Why is it valid to conflate every kernel that runs a *nix OS together under "Unix"? Is there any meaningful overarching "Unixy" way in which, say, Xnu and Linux are similar?

Locke1689 3660 days ago

NT's approach to async IO is at least somewhat empirically better as it does not require an extra context switch between receiving a "ready" event and actually performing the IO operation.

gpderetta 3660 days ago

Completion notification requires committing a (potentially cache-hot) buffer for an operation that might not complete until some time in the future. With readiness notification you only need to have the buffer ready when you know you will use it.

Also, on an high performance poll/epoll/kevent based system you only need to poll cold fds, while you can do speculative direct read/writes to hot fds, so no need for extra syscalls in the fast case.

That doesn't mean that completion notification doesn't have its advantages, especially when coupled with NT builtin auto-sizing thread pool, but it is not strictly better.

trentnelson 3660 days ago

You could do that if you really wanted on Windows; just set a 0 byte user space send and receive socket buffer.

You can do dual synchronous/asynchronous socket I/O in Windows. I use this very approach with PyParallel (and 0 byte send buffers): https://github.com/pyparallel/pyparallel/blob/branches/3.3-p...

Depending on current load, that will either immediately do an asynchronous operation, or attempt synchronous non-blocking ones up to a certain number, then fall back to asynchronous.

Described here: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...

deprave 3660 days ago

You must mean "theoretically" because "empirically" implies that you're basing your statement on observations. Where are the numbers? :)

ksk 3660 days ago

To level the playing field, and if we are to take your opinion seriously, it would be beneficial to know what books or articles or whitepapers you have read to inform yourself about the design of the NT kernel.

deprave 3660 days ago

I'm sorry, who's "we"? You replied to a comment that presented fact. The fact is that Windows use is mostly limited to desktops. As far as you're concerned, I could be entirely illiterate and my argument would still hold because it's based on fact. If you want to claim Windows is superior, please don't point us to design documentation. Show us actual numbers and use cases where Windows outperforms Unix, or is used in critical infrastructure, etc.

(FYI, I've read the "internals" book for a relatively old version of Windows, along with plenty of books about attacking the Windows kernel through its huge attack surface that exists to accommodate various needs of various software vendors...)

ksk 3659 days ago

>You replied to a comment that presented fact.

False. You ASSUMED facts based on co-relation - "Its used everywhere" doesn't mean anything. "But people must have a reason to use them" STILL doesn't mean anything. "Well, so why don't they use windows" STILL doesn't get you anywhere.

> I could be entirely illiterate and my argument would still hold because it's based on fact.

No. You can't enter into an argument when you know nothing about the subject. That is not how it works.

Why don't YOU present actual facts about the design? Show us you actually understand the internals of the kernel or have atleast some rudimentary knowledge. Otherwise you'd just be wasting everyone's time.

bitwize 3660 days ago

There are more server deployments of Windows than there are of Linux.

This is because Linux's server workload is mainly the Web. But every departmental office needs an Exchange server...

xorblurb 3660 days ago

Well, that's an (unbacked) opinion, and I don't share it. NT design is not too bad (obviously especially in contrast with Consumer Windows), and especially given what was achieved on the first few releases (that was made easier by Cutler serious experience in the area), but now it is far from brillant, and it has it (huge) share of problems every serious users of both Windows and Unix based OS knows.

Now at one point, way in the past, NT was far above Linux, and some Linux fanboys existed that did not even knew what they were talking about, yet had strong opinions of superiority about the kernel they used. Now we are ironically in the opposite situation: Linux has basically caught up on all the things that matters (preemptive kernel, stability, versatility, scalability) and then quickly overtook NT, yet some people like to talk endlessly about the supposed architectural superiority of NT, that did not provide anything concrete in the real world in the long term and widely used, and that MS had to work around and/or redo with an other approach (while keeping vestigial of all the old ones) to do all its modern stuff.

What kernel hackers know to do, is to detect problem in architecture that look neat on paper. Brillant ones are able to anticipate. I don't even have to: history has shown were NT has been hold back by its original design.

trentnelson 3660 days ago

You, I like you. You get it.