| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ziedaniel1 806 days ago

Very cool idea - but unless I'm missing something, this seems very slow.

I just wrote a simple loop in C++ to sum up 0 to 2^30. With a single thread without any optimizations it runs in 1.7s on my laptop -- matching Bend's performance on an RTX 4090! With -O3 it vectorizes the loop to run in less than 80ms.

    #include <iostream>

    int main() {
      int sum = 0;
      for (int i = 0; i < 1024*1024*1024; i++) {
        sum += i; 
      }
      std::cout << sum << "\n";
      return 0;
    }

4 comments

LightMachine 806 days ago

Bend has no tail-call optimization yet. It is allocating a 1-billion long stack, while C is just looping. If you compare against a C program that does actual allocations, Bend will most likely be faster with a few threads.

Bend's codegen is still abysmal, but these are all low-hanging fruits. Most of the work went into making the parallel evaluator correct (which is extremely hard!). I know that sounds "trust me", but the single-thread performance will get much better once we start compiling procedures, generating loops, etc. It just hasn't been done.

(I wonder if I should have waited a little bit more before actually posting it)

jay-barronville 805 days ago

> (I wonder if I should have waited a little bit more before actually posting it)

No. You built something that’s pretty cool. It’s not done yet, but you’ve accomplished a lot! I’m glad you posted it. Thank you. Ignore the noise and keep cooking!

phkahler 805 days ago

>> Bend has no tail-call optimization yet.

I've never understood the fascination with tail calls and recursion among computer science folks. Just write a loop, it's what it optimises to anyway.

fulafel 805 days ago

Computer science traditionally focuses on algorithms. Divide-and-conquer algorithms are frequently much more clearly expressed as recursion. Quicksort is the classic example, also various tree operations.

runeks 804 days ago

I've never understood the fascination with programming languages among computer science folks. Just write machine code directly, it's what it compiles to anyway.

nneonneo 806 days ago

If they’re low-hanging fruit, why not do that before posting about it publicly? All that happens is that you push yourself into a nasty situation: people get a poor first impression of the system and are less likely to trust you the second time around, and in the (possibly unlikely) event that the problems turn out to be harder than you expect, you wind up in the really nasty situation of having to deal with failed expectations and pressure to fix them quickly.

LightMachine 806 days ago

I agree with you. But then there's the entire "release fast, don't wait before it is perfect". And, then, there's the case that people using it will guide us to iteratively building what is needed. I'm still trying to find that balance, it isn't so easy. This release comes right after we finally managed to compile it to GPUs, which is a huge milestone people could care about - but there are almost no micro-optimizations.

naasking 806 days ago

That's how development under open source works. You can't please everyone.

nneonneo 806 days ago

There’s a big difference between developing something and announcing loudly that you have something cool; the developers have done the latter here.

vrmiguel 806 days ago

I think it's clearly pretty cool even if not as fast as people expect it to be

Ar-Curunir 806 days ago

Thats completely unfair. They have developed something cool, just with not all the holes plugged.

LightMachine 806 days ago

Dude we're running unrestricted recursion and closures on GPUs! If that's not cool to you, I apologize, but that mind-blowingly cool to me, and I wanted to share it, even though the codegen is still initial. Hell I was actually going to publish it with the interpreters only, but I still coded an initial compiler because I thought people would like to see where it could go :(

mst 804 days ago

The closure part was when I had to stop for a moment and go "wait, really?! ... COOL!" and I'm definitely going to try and remember to check back every so often (emphasis on 'try' given I have a brain like a sieve but still ;).

trenchgun 806 days ago

It is pretty cool milestone achieved, just not production ready.

adw 805 days ago

This is very cool and it's being treated unfairly, though it's also obviously not ready for prime time; it's an existence proof.

To illustrate that, many people on here have been losing their mind over Kolomogorov-Arnold Networks, which are almost identically positioned; interesting idea, kind of cool, does what the paper claims, potentially useful in the future, definitely not any use at all for any real use-case right now.

(In part that's probably because the average understanding of ML here is _not_ strong, so there's more deference and credulousness around those claims.)

nneonneo 806 days ago

You might want to double check with objdump if the loop is actually vectorized, or if the compiler just optimizes it out. Your loop actually performs signed integer overflow, which is UB in C++; the compiler could legally output anything. If you want to avoid the UB, declare sum as unsigned (unsigned integer overflow is well-defined); the optimization will still happen but at least you’ll be guaranteed that it’ll be correct.

ziedaniel1 806 days ago

I did make sure to check before posting.

Good point about the signed integer overflow, though!

molenzwiebel 806 days ago

If compiled with -O3 on clang, the loop is entirely optimized out: https://godbolt.org/z/M1rMY6qM9. Probably not the fairest comparison.

LightMachine 806 days ago

Exactly, this kind of thing always happens with these loops, which is why I think programs that allocate are fairer. But then people point out that the C allocator is terrible, so we can't make that point :')

mst 804 days ago

Might be worth seeing if e.g. jemalloc is enough less terrible for the sort of examples you're looking at to help with that.

(though note that I mention jemalloc because I've had "huh, this C code now magically runs faster" experiences with it, make no claim it's the right one to look at, and am very sure that I don't know what I'm talking about sufficiently wrt allocators to be able to recognise the right one if it bit me on the leg)

ziedaniel1 806 days ago

I used GCC and checked that it wasn't optimized out (which actually surprised me!)

rroriz 806 days ago

I think the point is that Bend in a much higher level than C++. But to be fair: I also may be missing the point!

gslepak 806 days ago

The point is that Bend parallelizes everything that can be parallelized without developers having to do that themselves.

5- 806 days ago

here is the same loop finishing in one second on my laptop, single-threaded, in a very high-level language, q:

  q)\t sum til floor 2 xexp 30
  1031