Hacker News new | ask | show | jobs
by NigelTufnel 4515 days ago
I did a stupid microbenchmark and it turned out that simple class with __slots__ was the fastest method.
2 comments

While we are at stupid microbenchmarks. The following loop in C (line being a struct, length is int):

  for (int i = 1; i <= 400000000; ++i) {
    ++line.length;
  }
Compiles to the following ASM:

  @8:
  inc eax
  dec edx
  jne @8
..as expected. The compiler realized that there is no need to actually copy the intermediate values to line.length and does everything in registers.

Now here is the same loop in Lua (everything dynamically typed, line.length must be hashed (in theory) just like in Python):

  for i = 1, 400000000 do
    line.length = line.length + 1
  end
LuaJIT 2.1 generates the following ASM:

  ->LOOP:
  addsd xmm7, xmm0
  movsd [rax], xmm7
  add edi, +0x01
  cmp edi, 0x17d78400
  jle 0x7fee13effe0  ->LOOP
The C program executes in ~0.3s, the Lua one in ~0.5s .. and those 0.5s include LuaJIT's startup, profiling, and JIT compilation time. So for a longer running program the difference would be even smaller.

Tl;dr: modern JIT compilers are amazing and can optimize away the things mentioned in the article.

I think you forgot to turn on your optimizer. GCC will optimize that loop into something along the lines of

  line.length+=400000000;
That is because it can predict the loop result during compile time.

For example, this is the entire program with your loop and a line to print the result (gcc 4.3.4/linux -O3),

  mov    $0x17d78401,%esi
  mov    $0x400654,%edi
  xor    %eax,%eax
  jmpq   0x400430 <printf@plt>
In fact without the printf all you get is "reqz ret" (not counting .init overhead in the binary). That is because the compiler detects that line.length is not used and fails to even set it.

  #include <stdio.h>

  int main(int argc,char *argv[])
  {
	struct lines
	{
		char *somedata;
		int length;
	} line;
	int i;

        line.length=0;

        for (i=0;i<=400000000;++i)
	{
		++line.length;
	}

        printf("Line len %d\n",line.length);
  }
>I think you forgot to turn on your optimizer.

I did not. I used Pelles C to compile it (with the optimizer turned on). I am not surprised that GCC managed to eliminate the pointless loop entirely in this situation. In fact I was happy that both Pelles C and LuaJIT did not realize that the whole loop was pointless and thus I did not have to come up with a more complex example.

The primary point of this was to show that JIT compilers can optimize away hash table access and dynamic typing in hot loops, not a code generation competition between LuaJIT and GCC.

I am a pretty big LuaJIT fanboy and not even I would claim that Lua compiled with LuaJIT can compete against C compiled by current versions of GCC with all optimizations turned on, at least not in most real-world cases. However, it does get amazingly close, most of the time within an order of magnitude, which means that I personally do not need to use C anymore.

Curious if you included ctypes.Structure in your benchmark.
I just did. And either a) I'm doing something terribly wrong or b) ctypes.Structure is slower than Python class with __slots__.

I would bet on a) - I've never worked with ctypes before.

EDIT: I bet on a) and I was wrong: ctypes offers no speed advantage.