Hacker News new | ask | show | jobs
by Matthias247 3423 days ago
I have worked in the network programming domain for the last few years and I also found that especially outsiders and newbies get too obsessed on pure performance figures. Especially for all networked stuff there's also a very important other key metric, which is reliability, which is only seldom taken into consideration. However reliability can have a huge impact on performance.

E.g. not implementing read/write timeouts allows to omit lots extra code (timer management, synchronization, cancellations), which improves performance. But it might bring a whole system to stop if there a few non responsive clients. Or not implementing flowcontrol through the whole chain and simply buffering at each stage can give a huge boost on the throughput metric. But sooner or later the system might go out-of-memory.

I personally now see reliability the number 1 thing you should achieve in a protocol implementation. Performance is of course also important, but should only be compared if all other parts are also comparable.

2 comments

In response to the response that now got flagged twice:

I think you read a lot into the parent posts that wasn't there.

Let me restate what I believe to be the parents meaning:

Many junior developers care to much about the "How quick is my normal execution path" form of performance. This is a bad measure for actual performance because the rare, error-related executions can have cascading effects effectively blocking the entire network.

Allowing applications to wait indefinitely for a response. Even if asyncronous, is something like a 'thread' leak where you start accumulating dead threads eventually leading to slowdown. This would be one example.

Another would be weird broadcast storms that happen when a component fails.

Basically, consider cascading effects of errors when optimizing performance.

Projects where 'performance' is taken to be "how quick is my usual case execution path".

Thank you, but I read their comment carefully, and I'd like to let this person (Matthias247) speak for themselves. (I've asked mods to unflag my comment.) I hope they will respond.

To reply to the take on their comment that you've just written: I'm not talking about the decisions junior engineers make. I'm talking about the decisions senior architects, who whiteboard and diagram solutions as complicated as necessary (which is the correct approach), make. They are making the wrong decisions, using the wrong trade-offs. They are not doing their job well.

The specific issues you have paraphrased could be solved in a different way (I'll just quote what you just said: "something like a 'thread' leak". This has specific possible solutions). The point is, that way is not the way that has been chosen, due to bad, incorrect, wrong decisions.

It's not that there are leaks or bugs (I'm not talking about the work of junior engineers). It's that the chosen, correctly implemented algorithm implements the wrong choices.

Let me give you an analogy: there is a very, very good sort algorithm called quicksort. It has very good behavior and is commonly used. It has excellent theoretical properties.

In its first naive implementation the worst case happens when an array is already sorted or nearly sorted. (http://www.geeksforgeeks.org/when-does-the-worst-case-of-qui...) [1] As a practical matter sorting things are often done in cases where they might be sorted already or nearly so.

So it's not that the other cases don't need to be taken into consideration - after all even bubble sort works optimally when lists are already sorted....

It's that it's wrong to code quicksort by making the choices that ignore the most common case. Anyone coding the naive quicksort implementation I mentioned on data that is frequently already sorted or nearly sorted is not doing their job well.

In the case of network logic, the wikipedia article I linked shows that it does not even have technical properties that mean it is theoretically correct under all network conditions. So it's even worse than a naive quicksort: it's broken for the most common case, and not theoretically correct (because that's not possible) for every case.

They simply need to wake up and change their trade-offs and priorities. For example, by randomizing sort order for quicksort, of course this adds steps - at the same time, it improves the most common condition (sorting an already-sorted or nearly-sorted array.) Use this analogy and, yes, by God, code (and more importantly, architect) for the common case!

[1] http://www.geeksforgeeks.org/when-does-the-worst-case-of-qui...

EDIT: An earlier version of this comment has been flagged, but I stand by it and am addressing the parent poster. Feel free to disagree with me (feel free to comment), but I have communicated really clearly, and it is an important thing to communicate. See note at bottom.

The following is tough love:

>I have worked in the network programing domain for the last few years and I also found that especially outsiders and newbies get too obsessed on pure performance figures.

no need for the introduction, your attitude shows it all. It's why we all wait for 35 seconds while we watch a timer animation instead of getting a response instantly (200 milliseconds) and one time out of ten thousand having to resubmit a page and you having to deal with it. But by all means, 10000 * 35 seconds is only 97 hours. I'm happy to wait 97 hours if it means I won't have a 1/10,000 chance of having to click Submit a second time - wouldn't you? Or even a one in fifty chance? I mean wouldn't you rather wait for 35 seconds, versus either getting an instant response (98% chance) or a 98% chance of an instant response the second time you try and a 98% chance of a response the third time you try? No brainer. Who wouldn't love to wait, wait, wait, wait. It's my favorite part of using a computer! Waiting! I can anticipate how great it will be when stuff works. It reminds me of downloading over a 14.4 KBps modem (which due to the lack of web apps at the time was actually much faster in many cases, but thankfully you've fixed that.) On your end you won't have to code up what happens when I do resubmit or not get your response, which takes logic and math or a hand-coded edge case, that civilization probably will never discover and could not possibly code. I mean how can a database possibly be set right if it ever gets a transaction twice or fails to get a transaction the user really did request. It doesn't make any sense! Would you ever tell a friend the same thing twice? Or would you just tell them once, and even if it takes them 3 weeks to get your invitation for Friday, at least you won't accidentally send it twice, embarrassing yourself and your friend, or, worse, having them show up twice. The real world shows that the tradeoffs you network engineers make every day to give me 35 second web page experiences are the correct trade-offs. After all, it's my time, not yours.

/s

You people make the worst trade-offs ever. Your decisions suck. Your work sucks. The web sucks, because of you.

Change everything radically. Figure it out. Don't boast about newbies/outsiders not understanding - you don't understand the correct trade-offs.

Plus the two general's theorem[1] shows that you can never write correct code on the theoretical level, so that other than every single thing you do being practically broken, it's theoretically broken too. Everything you guys do is broken and sucks, theoretically as well as practically. wake up already.

[1] https://en.wikipedia.org/wiki/Two_Generals'_Problem

----

Note: I took a very aggressive tone to counteract the complacency I quoted. My goal is to have parent poster rethink their whole life (in the network programming domain.) Please don't flag/downvote it if you want a better web tomorrow than we have today, because the parent and others like them is the one responsible for this. Only they can wake up and start making the correct trade-offs. It gets so bad that I manually open a new tab, slowly type in google, slowly re-authenticate, and go through the same action a second time, then close the (still loading) first tab, just because people like this person have made trade-offs that are so bad I have to work around it myself. Their decisions are wrong.

Reliability, the way network engineers have been moving toward coding for it for the past decade, is a false God. The approach is not correct. It must change if you want a better web tomorrow (or at least reply to it) or you are complacent in the thinking which the parent comment very explicitly shows. I have edited this comment considerably to be really clear, and gave multiple examples. As you can see I have 2546 karma and have been using HN for 1386 days. I stand by criticism.