Hacker News new | ask | show | jobs
by dragandj 2256 days ago
What is flawed there? The point of the article is to show that CuPy very often does not accelerate NumPy, especially on consumer-grade hardware. This is something that most users of NumPy/CuPy do not know, and they are led by the docs to think it does.

The reason for that is that CuPy is poorly implemented. And CuPy is poorly implemented because it is constrained by what NumPy does, which, in turn, does stuff that is OK on the CPU, and often translates poorly to the GPU.

3 comments

> What is /flawed/ there?

You're comparing float32 vs float64 computation. I don't need to tell you how much slower DGEMM is vs SGEMM esp. on the GPU (you mention this in the post yourself!).

Numpy does this for precision reasons, and CuPy simply follows its behavior. This is precisely why I noted that the float32 version runs 3x faster on the CPU.

> The reason for that is that CuPy is poorly implemented.

It's a cheap shot to call something 'poorly implemented' when you don't understand what you're benchmarking.

I still think I'd have to disagree with you. What was benchmarked was NumPy/CuPy, and the numbers in the article are not flawed. It isn't that they are using NumPy/CuPy wrongly, that's what you'd do, and even if you try really hard to specify everything as float32 it still will have the same performance timing as in the article.

It would be interesting to compare it against a float64 version in Neanderthal as well I agree with that.

That said, a flawed benchmark would mean to me that it isn't indicative of the performance one can expect when actually using the library on real world use case, but for now this benchmark for NumPy/CuPy does seem to be indicative of what you'd expect.

Now, the next question is, for model accuracy vs scale, is going with float64 coercion always the ideal trade off? What if you still needed to squeeze more performance? Is it really a bad idea to do so by going down to float32? Especially considering how much faster GPU can accelerate that?

It's a lazy benchmark that compares two functions without clarifying that they do different things. If this article was pointing out that the numpy function had poor UX, that would be valid, but presenting a table that suggests Clojure does the same work 200 times faster is blatantly misleading. If you're going to write a custom implementation for one side of the benchmark you either need to (a) make sure it does the same thing as the other side of the benchmark or (b) write a custom implementation for both sides.
This is a reply to bearzoo, but we reached the depth level of the comment thread.

>> Nope, we work with float32.

>It seems that is not true.

Well, it is true. We work with float32. I explicitly checked that and NumPy/CuPy answered that the array is indeed float32. The fact, which I didn't know then, is that NumPy/CuPy internally decides on its own to use float64 without warning or possibility for us to order it not to do that. But, it is not I (or "us") that work with float64.

I would say that your comment would stand if I mistakenly ordered NumPy/CuPy to use float64 while claiming that I work with float32 (which could happen hypothetically if there was a non-obvious, but documented option to use float32 that I missed).

Yeah, you passed in float32. The function always promotes to float64 instead. You can argue that NumPy/CuPy should provide a purely float32 option, instead you went something like “look at how my expf beats their exp in benchmarks! What are you talking about ‘not the same function’, I made sure to pass float not double!” (a libc example to illustrate the point). And it’s not clear to me you realized the difference before akssri pointed it out, which renders the benchmarks pretty meaningless.
Where did you get that? Even the main title is refferring to the main point being that CuPy does not accelerate NumPy even in the case where it should be absolutely expected to. Then I used my implemetation to demonstrate that indeed GPU implementation for such a huge matrix should be many times faster.

I never claimed that my library aims for being a replacement for CuPy, or to have any compatibility with NumPy.

It would be more valuable to CuPy developers if I debugged CuPy to discover why that problem exists, but why should I be obligated to? I was writing this for a perspective of a user of these libraries.

Would it be an idea to benchmark Neanderthal with float64 as well, just to gather some data on it?

I agree with both of you, you’re both looking at this thing from a different perspective. It’s perhaps just better to gather timing measurements on a few variants with the trade-offs that each library has made, and how that affects implementation / speed.

The flaw, in my opinion, is not necessarily in the benchmark but in the claims and observations around the benchmark. For instance the blog flat out says:

> Nope, we work with float32.

It seems that is not true. The blog should make clear that it isn't straightforward or possible to get cupy to do this.

Can you please provide a benchmark for corrcoef where CuPy is noticeably faster than NumPy on your (and mine) GPU, Nvidia GTX 1080Ti?
I didn't get that from the article at all. There was no "that's because..." or "you need to do this to make it fast..." in the article. Instead, it's "clojure is faster without additional work". While that's super neat and thank you for showing this, bashing python because the library doesn't automatically do that and then hiding behind this silly argument isn't that enlightening.
You can check in the article that I made many checks to make sure that NumPy/CuPy get the data in float32. What do you suggest to do to instruct NumPy/CuPy to "automatically do that". Is there a way to say to python "I want you to use float32 precision for this computation" other than, well, providing everything as float32?

And even if your argument stands, that does not change the fact that CuPy does not accelerate NumPy (in this particular case, but I'd say often).

I didn't read it as bashing Python, only as bashing NumPy/CuPy.

If you look at the article, it is not even using Python anywhere, it uses NumPy/CuPy directly from Clojure. So it really is comparing Neanderthal vs NumPy/CuPy, and uses Clojure in both cases.

> I would love to improve any part of this article, if possible!

You should state that the reason the neanderthal version runs faster is that it is doing the computation in lower precision than numpy / cupy.

The article walks through an investigation of why cupy's result is underwhelming (is the input data accidentally fp64? is the input data on the cpu? is the computation happening on the cpu?), so you should finish it by explaining that numpy and cupy do the computation in fp64.