Hacker News new | ask | show | jobs
by valarauca1 4315 days ago
How complex? In one second a modern processor can do ~1 billion operations (ish, some are faster, some are slower, sometimes multiple are done in the same clock tick). Even if its slow, core2 architecture.

This means they have the time for about ~200 million instructions per request (Ignoring internal disk I/O, or network I/O).

That amount of work is insane!

:.:.:

I want to say their doing something fundamentally wrong. And it has nothing to do with their language.

2 comments

The i7 can dispatch 4 instructions per cycle. In practice, I find it can realistically execute about 2 per cycle. So at 2GHz, that's closer to 4 billion instructions per second, or ~800M instructions per request as per your calculation.

The slowness of their system can probably be blamed on slow database access, or some kind of initialization cost they're paying for every request (i.e.: calling into a binary like in the old CGI days, initializing the Python VM every time).

Like I said something fundamentally wrong. With their approach.

Un-Indexed databases, databases far away from front end servers (in terms of network topography), or weird VM things with python.

Something is bad, and if changing languages solved their problems, they are just sweeping an issue under the rug. It'll hit them in the face later, and harder. Be it developer knowledge, or architectural choices. It'll surface again, they'll (hopefully) be large, and the problem will sting harder too.

Unless you're a mathematician or theoretical physicist the gross of your CPU time will be spent waiting for IO. Reading from disk, writing to the network, synchronizing, etc.

They're probably just aggregating 2 or 3 APIs, maybe hitting a database and then adding it all together.

That description can apply to almost any and all web applications and is inherently IO-bound.

"the gross of your CPU time will be spent waiting for IO. Reading from disk, writing to the network, synchronizing, etc."

People just sort of chant this, yet... if you upgrade from Python to something on the faster end of the spectrum, like D, you are very likely to still experience significant speed up, in my experience, even if you don't touch IO access patterns. You're even more likely to see a real latency decrease. And that's before we start actually multithreading or anything.

For all the work done on them in the past few years, the dynamic languages remain slow, slow, slow.

I think people often don't look at the math very carefully... if you do, say, half a dozen DB queries each less than 1ms, but your entire web page is clocking in at 50 or 100ms of rendering, all numbers that are very easy to see in real life (such as my own personal Django blog, where I've carefully counted each DB access and carefully indexed all of them), you are not actually spending all your time in IO wait.

One of the worst cases for a dynamic language is crawling a large object hierarchy, obtaining lots of tiny objects from them, and then merging them together in the end. You pay and pay and pay for the constant new object creation, reference count management, endless resolutions of methods, and all the other things dynamic languages are doing over and over and over (even when JITed).

Now, guess what "rendering a template" looks like internally.

Oh, and don't forget, if the DB returns in 1ms but your language reports the query took 5ms, you can't count the time it took your dynamic language to handle what came back from the DB as IO wait!

(I have to admit, I'm really done with the dynamic languages. It was fun when the megahertz went up every year, but now it's like wearing 20lb concrete shoes and trying to pretend that's not a problem, it doesn't affect my performance at all... and the 20lb is already after we cut it down from 40lb with all the JITs and stuff, which rhetoric notwithstanding simply do not produce anything like C-like performance in practice.)

I agree with you. There is a structural problem with the view put forward by the parent. If the validity of a view relies on I/O being the bottleneck, it encourages coding practices among that keep the implementation I/O bottlenecked. Once you already believe such a thing to be true you have little motivation to challenge or overcome it.

It is not the case that I/O bottlenecks can always be overcome, but it can often be, depending on circumstances, provided one tries of course.

"Unless you're a mathematician or theoretical physicist the gross of your CPU time will be spent waiting for IO"

As someone working in VM research, I'm not entirely convinced this is true. Perhaps we should do some work to find out where most web applications really do spend their time.

Here's one real-world example: Rap Genius. They certainly aren't mathematicians or theoretical physicists, all they do is process text I think, but it appears they spend over half their time in the Ruby interpreter - not waiting on network or database (if I'm reading the graph correctly).

http://images.rapgenius.com/75aa2143d3e9bf6b769fc9066f6c40c8...

I can't find the blog post, and the above poster likely knows more then me.

I remember seeing a post several years ago that XEN increases the likelihood of cache misses by 25-50%. Thus while the CPU looks to be at 100% processing power its really waiting on RAM/cache.

I'm not convinced. There are benchmarks[0] that strongly suggest a choice of language is as much a factor in general performance (not merely I/O tasks) as platform (though these frequently go together) and hardware.

[0]http://www.techempower.com/benchmarks/#section=data-r9: in which python frameworks generally perform poorly compared to...anything else other than Ruby

That entirely depends on your data size.

If you have just a few thousand pages, with a few thousand bytes each in a normal (real) server, no your computer will keep everything at memory, and will only touch the disk for saving data. Also, if you have enough independent requests, throughput will not be network bound.

But CPython may still keep most of its time waiting for RAM. D is much better at this, as is Pypy.

This can be solved by minimizing blocking code, either by using actors (Erlang, Akka, etc) or just chaining callbacks on futures/promises and using monadic composition to avoid callback hell.

For a concrete example, check out my Redis-based Twitter clone http://typesafe.com/activator/template/redis-twitter-clone, running at https://clockwork-semaphore.herokuapp.com/#.

>Unless you're a mathematician or theoretical physicist [...]

..or Dwarf Fortress player ;)

Oooh thank you for reminding me. I recently upgraded to a i7-4790k I've been meaning to jump back into DF now that I should have much better single threaded performance. I honestly haven't played since 2008 on my parents home pentium3.
> That description can apply to almost any and all web applications and is inherently IO-bound.

If their system was truly just IO bound, then moving to D wouldn't help them.

That's not true. Well it's true in a very narrow technical sense, but it's not really true.

For example, the amount of housekeeping python does in order to execute a function call is staggering. It leads to all sorts of nice functionality, but nevertheless (plus C++/D does it almost entirely without housekeeping. Either no housekeeping, or 1 level of indirection).

Python has so many indirections for a function call it hardly even makes sense to talk about it in numbers of indirects.

Assembly hello world on my machine : 86,607 cpu cycles (of which < 20 actually in the program) Syscalls used by the assembly version : 2 (write and exit)

Python hello world on my machine (.pyc was available) : 59,099,731 instructions (including half a million branch misses) Syscalls used by python to execute 'print "hello, world"' : 1139 (each of which causes a program reschedule)

These programs do the same thing. Programmers often forget that things they take for granted are not in fact free, they may not even be O(1). Memory allocation. Subprocess execution. Function calls in scripting languages. Syscalls. Writing to files. Allocation of bytes on disks. All of these things come at a really, really high cost, and most not even O(1) costs (e.g. memory allocation is O(N^2) on a busy server as long as things actually fit in main memory, and O(N^4) or even worse when using virtual memory).

Sadly using memory does not even have bounded complexity. At some point, just attempting to use virtual memory might cause virtual memory to be allocated just for the lookup. This is generally referred to as "thrashing" and you're very likely to have rebooted your machine before this completes because it'll be frozen for minutes, sometimes hours, if this happens.

Likewise the memory model is useful, but huge. Strings in C++ take one byte + the actual contents of the string. Strings in python take up 60 bytes + twice the length of the string. And that's assuming you just set a variable to the string. If you construct the string, the difference is going to be much bigger.

The point here is that things that are io-bound (esp. memory bound) in python may be cpu-bound in C++ or D, simply because you avoid doing all the indirections that higher level languages do.

> The point here is that things that are io-bound (esp. memory bound) in python may be cpu-bound in C++ or D, simply because you avoid doing all the indirections that higher level languages do.

I think you mean the reverse: "things that are cpu-bound (esp. memory bound) in python may be io-bound in C++ or D".

How many instructions were executed after the interpreter was loaded into memory (a much more realistic analog to the twisted server model)?

Once loaded into memory, any program which is bound by IO to memory (i.e. moving the stack from memory to caches/registers) will show up in tools not as being IO bound, but CPU bound.

And yes, CPU bound programs will benefit greatly from moving the hotspots into a linked module written in C or Cython.

I have no problem with moving away from Python (I'm in the process of doing this myself), but the costs associated with re-writing an entire program (especially one complicated enough to only handle 50 requests per second) are non trivial, and if there was simply a small CPU hotspot, it could have been smoothed away in a number of ways that don't involve learning a new language.

In short, everything points towards OP moving to D because of a personal desire instead of a real business case.

Used to write translators for computer languages. Biggest was PL/M to C. That was easy because PL/M had fewer constructs than C. I managed to recognize constant declarations and map them to #defines or consts which actually made the code More readable.

But these days, languages have features that may be completely orthogonal to other languages. Automatic translation may not be possible. Still it would be by far the cheapest solution.

Slightly unrelated, but this confused me.

>1139 (each of which causes a program reschedule)

I thought system calls were packaged into the binary itself and didn't necessarily cause a job to re-schedule. But just caused a context switch to take place, then execution continues.

I thought re-scheduling only happened on interrupt, or a thread reaching a blocked stated.

Could you clarify this for me, I'm interested.

About 10 years ago I remember prototyping some code on Linux with a perl script running a java program as a "coroutine" (er, service) via request/response pairs over a socket (not http). Then we moved it to AIX, where it was essentially unusable due to the lost time slice each time an IO sys call was made. On Linux, the remaining time slices were recovered and immediately used. On AIX, the time slice was simply lost until the next process scheduler tick. Ouch.
Technically they cause a context-switch and a scheduler run upon return (I belive, not 100% sure), but you're right that does not necessarily result in getting put on the back of the work queue.
I think what might be the case is often so many system calls result in blocking events (I/O, IPC, etc.), its safe to assume a system call will block.