Hacker News new | ask | show | jobs
by tptacek 5002 days ago
This argument reminds me of when I was 15 and trying to explain Unix security to my democoder friend, who kept insisting that he could bypass file permissions by just programming directly to the hardware and reading the bits off the disk.

If you want to keep arguing that every other language merely asymptotically approaches the performance of x86_64 assembly, that is a fine (if dumb) position. But the argument that C has a monopoly on assembly code generation is silly, as is the argument that C compilers are inherently better at code generation than all other compilers. Look at the LuaJIT2 direct-threading post upthread for an infamous example of a performant dispatching structure that isn't easily expressible in ANSI C.

2 comments

> Look at the LuaJIT2 direct-threading post upthread for an infamous example of a performant dispatching structure that isn't easily expressible in ANSI C.

Sure, but the point of that post is that hand-written assembly can beat C-compiler-generated assembly. It's not arguing that any other language's runtime (even LuaJIT's own code generator) is going to generate assembly that's any better than gcc's in this case.

I do know that one of Mike Pall's stated goals with LuaJIT is to minimize the performance difference between Lua and C as much as possible. Though one of his methods of doing so (the foreign function interface: http://luajit.org/ext_ffi_tutorial.html) gives up the memory-safety of Lua to accomplish this. Even without FFI, LuaJIT is incredibly fast for functional/numerical calculations.

> If you want to keep arguing that every other language merely asymptotically approaches the performance of x86_64 assembly, that is a fine (if dumb) position.

I think what rubs some people (including me) the wrong way about these "faster than C" claims is the implied argument that C (and other non-GC'd languages) are obsolete. These claims often come from people who, as this article's author explicit states, want higher-level languages to win because they don't want to ever have to write in C or deal with anything written in C.

If people can get stuff in higher-level languages to run faster, then great, but those of us working in system-space in non-memory-safe languages are still going to beat their pants off if we combine all the tricks of their VM with domain-specific knowledge and optimizations (specifically parsing, in this case). "Faster than C" claims can sometimes feel like a big "F U" to those of us trying to innovate in system-space, like "our VM is the only system software we'll ever need." It's the Java attitude where people only want "100% pure Java" and shun JNI and anything that won't run on their VM.

Why would it be hard to generate the code for a direct-threaded interpreter?
That's exactly the question that Mike's whole post is answering. It turns out that the common optimizations like hoisting, CSE, tail-merging, etc. perform badly on direct-threaded interpreter code. Which is one of the primary reasons Mike wrote the interpreter for LuaJIT2 in hand-written assembly instead of C.
I think it is easier for the compiler for a high-level language to predict usage patterns --- which branches will be taken, which variables should be in regs or L1, which functions are best served by slower code with a small memory footprint vs. faster code with a larger footprint, what's best handled with a jump table versus with conditional jumps, &c &c --- that can a C compiler, and that over the long run languages with better, more consistent abstractions than C are going to have better runtime characteristics than the outputs of a C compiler.

Put differently: C compilers are hamstrung by all the dumb C programs a developer could express with C.

Javascript compilers have to deal with bad Javascript, don't get me wrong, but they don't have to deal so much with bad list and table implementations.

I get what the LuaJIT2 post is saying (and let's be clear: I'm not putting myself on a level with the Lua JIT people).

For what it's worth: C is my first language, and my favorite language. My background is systems programming. I would not enjoy figuring out how to reliably write a memory-mapped CSR while masking the right set of interrupts in Javascript. :)

> I think it is easier for the compiler for a high-level language to predict usage patterns

Profile-guided optimizations for C can give a C compiler a lot of the same information. It's true that these patterns could vary over the run of the program, though at that point you have to weigh the speed gained from being adaptive vs. the runtime overhead of monitoring the program's execution and doing multiple passes of code generation.

> For what it's worth: C is my first language, and my favorite language.

Yeah I know. :) A lot of what I wrote was directed more generally at the "faster than C" sentiment and not at you specifically.

> I'm not putting myself on a level with the Lua JIT people

I think you mean the LuaJIT person. I got to finally meet him this summer and verify that he is human, though I guess I can't guarantee that he doesn't have cybernetic implants or work together with others under a single pseudonym. :)

The Lua people are not like the Java people. The whole point of the luajit ffi is to make interfacing with C code really easy, without any more function call overhead than calling C to C. So it is the ideal language to interface with performing well written C code without losing the benefits.
I don't know why I am wasting my breath either since it is not my law that abstractions (if both implementations are sound) always come with a performance tradeoff. It is just a fact (therefore considered a law), not made up by me.
The opposite is often true. The right abstractions can provide a performance benefit, when they force programs to conform to constructs that are easily expressed in performant native code.

The standard C library is full of functions that are slower than equivalent functionality in higher level languages. Or do you think ticking through arrays of hopefully-ASCII bytes, byte by byte, waiting for the 0 byte, is the fastest way to compare two strings?

> Or do you think ticking through arrays of hopefully-ASCII bytes, byte by byte, waiting for the 0 byte, is the fastest way to compare two strings?

I'm sure you must know this, but this is not even remotely how strcmp() is implemented in modern libc's.

It is quite close, actually. "not even remotely" is a very strong statement.

There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86), but that doesn't change the fact that in order to compare two 128KB strings which are equal, you actually have to read 2*128KB from memory and compare each single byte (in groups of 8, if you are lucky enough with your alignments and instruction set).

Different abstractions, such as Python's strings, can very often do this comparison with almost no memory access:

(a) if both strings are interned, it is enough to do a pointer comparison.

(b) if the length is not equal, the strings are not equal - a one word comparison.

(c) if both strings have been hashed before (quite likely), you can tell they are different if their hash is different - a one word comparison.

(d) if length is equal, and hash is (equal or uncomputed), you will have to do the comparison.

Whether this trade-off is worthy depends on your application. If most of your strings are 7-characters or less (as is often the case for software dealing with e.g. stock tickers), then the C approach on 64-bit archs wins hands down: you should actually have all the strings in-place because a pointer takes more memory and causes contention. However, if your strings tend to be 100 bytes or above, and many of them have equal prefixes, the Python approach wins hands down.

> There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86)

It's actually 16 bytes at a time on any machine that supports SSE (maybe 32 bytes soon with AVX).

> Different abstractions, such as Python's strings, can very often do this comparison with almost no memory access:

Sure, different abstractions have different trade-offs. All of these abstraction possibilities are available to C. strlen() isn't "the C approach," it's just the most common one. Any C application where comparison of long almost-identical strings is important will surely use techniques similar to what Python does. On the other hand, the reverse is not true: Python does not have access to all the same optimizations that a C programmer could draw upon to do string processing.

I was mostly replying to your assertion that "that's not even remotely how strcmp is implemented", which, for most definition of "even remotely", is false.

> All of these abstraction possibilities are available to C

That's a tautology at best, and meaningless at worst. The way strcmp() is implemented, which we discussed above, is not actually available in C.

> Any C application where comparison of long almost-identical strings is important will surely use techniques similar to what Python does.

And similarly, any Python application that requires (insert some uncommon requirement ..) can do what C can with the same kind of help that strcmp() gets - by delegating to the layer that does it best.

> Python does not have access to all the same optimizations that a C programmer could draw upon to do string processing.

Pure python is more limited than C, true. But specific Python implementations (RPython, PyPy, IronPython) might have better access to some optimizations than specific C implementations.

And there's always the aspect of "what's theoretically possible" and "what happens in practice". The fact that PyPy will dynamically switch from 32-bit to 64-bit to unbounded-long-integer might make a real difference on a 32-bit machine where the code might occasionally require 2048 bits, but overwhelmingly requires just 32 bits.

It is possible to construct pathological cases where there are e.g. pow(2,128) possible type combinations within a function, the exact combination is only known from the data (but is consistent for an entire run) - in which case, PyPy will essentially compile the right program to machine code, whereas you cannot do AOT because of the number of combinations; which means a C program will essentially be an interpreter based on those types.

But I don't care about theoretically constructed pathologies. In practice, especially with time-to-implement constraints, it is not true that a C programmer has all the tools at their disposal that higher level languages have.

> > There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86)

> It's actually 16 bytes at a time on any machine that supports SSE (maybe 32 bytes soon with AVX).

Is there a sequence of fewer than 16 instructions to spot a NUL byte inside the 16 byte block?

What's a modern libc? Not being snide or anything: I have no idea. I didn't know where to look so I looked at glibc [1] and that, in fact, does seem to be how it works.

[1] http://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcm...

You might want to look at the non-generic implementations, like http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...
This might be the most delicious piece of code ever written ( I cannot judge that, really), but we're talking 2307 lines for a string comparison.

I'm impressed, but also scared and amused. I always look with envy at system level guys, lacking the knowledge to play on that level. This, though, comforts me quite a bit. That's just not my definition of 'fun'.

I've never written anything non-trival in C and even that was ages ago so I didn't hope to be able judge the niceties of different implementations but I can't even tell what the algorithm is in that one.
I have heard of musl[1] and Bionic[2]. Interested in hearing about any others.

[1] http://www.musl-libc.org/

[2] http://en.wikipedia.org/wiki/Bionic_(software)

Those seem to both use basically the same algo as the glibc one, all though a little more compactly written.
Uclibc and dietlibc. Must recommend Musl for code readability and implementation quality.
Care to provide and example?

Your statement would only be true if the abstraction implemented something more efficiently than the C developer. This in of itself says nothing about C or X language for that matter being faster.

It could also be true if the abstraction lays bare optimizations not available (readily enough) to the C compiler. While this wouldn't make X faster than C as languages (what does that mean, anyway?), it could make an X compiler produce faster code than the best C compiler.
You are not seeing my point. If the JIT compiler can be made better - than so can the C compiler right? Until some point where neither can be made better. Then C would win because it does not carry the extra overhead of the other.
OK, here is an example. Let's consider 'memcpy'. There are various problems with implementing memcpy in C:

1) You can't assume that pointers are aligned. 2) (related) If the user asks for you to copy n bytes, you better not read byte n+1, even if you don't need it, because maybe it is on a different page which is not available for reading.

So, an optimised memcpy usually first has to do a bit of work to find the 'aligned segment', do that, topped and tailed with some special cases.

A JIT compiler for Javascript (for example) takes issues of alignment out of the hands of the user, so can make any assumptions it wants by keeping such choices away from users.

Of course, we could write a memcpy_aligned (and some systems have such a method), but we can't ever optimise simple memcpy as much as the equivalent in javascript.

No, not neccessarily. There could be optimization opportunities in the more abstract language that perhaps cannot ever arise in the less abstract.
Sure: interned strings.