Hacker News new | ask | show | jobs
by akssri 2253 days ago
> What is /flawed/ there?

You're comparing float32 vs float64 computation. I don't need to tell you how much slower DGEMM is vs SGEMM esp. on the GPU (you mention this in the post yourself!).

Numpy does this for precision reasons, and CuPy simply follows its behavior. This is precisely why I noted that the float32 version runs 3x faster on the CPU.

> The reason for that is that CuPy is poorly implemented.

It's a cheap shot to call something 'poorly implemented' when you don't understand what you're benchmarking.

2 comments

I still think I'd have to disagree with you. What was benchmarked was NumPy/CuPy, and the numbers in the article are not flawed. It isn't that they are using NumPy/CuPy wrongly, that's what you'd do, and even if you try really hard to specify everything as float32 it still will have the same performance timing as in the article.

It would be interesting to compare it against a float64 version in Neanderthal as well I agree with that.

That said, a flawed benchmark would mean to me that it isn't indicative of the performance one can expect when actually using the library on real world use case, but for now this benchmark for NumPy/CuPy does seem to be indicative of what you'd expect.

Now, the next question is, for model accuracy vs scale, is going with float64 coercion always the ideal trade off? What if you still needed to squeeze more performance? Is it really a bad idea to do so by going down to float32? Especially considering how much faster GPU can accelerate that?

It's a lazy benchmark that compares two functions without clarifying that they do different things. If this article was pointing out that the numpy function had poor UX, that would be valid, but presenting a table that suggests Clojure does the same work 200 times faster is blatantly misleading. If you're going to write a custom implementation for one side of the benchmark you either need to (a) make sure it does the same thing as the other side of the benchmark or (b) write a custom implementation for both sides.
This is a reply to bearzoo, but we reached the depth level of the comment thread.

>> Nope, we work with float32.

>It seems that is not true.

Well, it is true. We work with float32. I explicitly checked that and NumPy/CuPy answered that the array is indeed float32. The fact, which I didn't know then, is that NumPy/CuPy internally decides on its own to use float64 without warning or possibility for us to order it not to do that. But, it is not I (or "us") that work with float64.

I would say that your comment would stand if I mistakenly ordered NumPy/CuPy to use float64 while claiming that I work with float32 (which could happen hypothetically if there was a non-obvious, but documented option to use float32 that I missed).

Yeah, you passed in float32. The function always promotes to float64 instead. You can argue that NumPy/CuPy should provide a purely float32 option, instead you went something like “look at how my expf beats their exp in benchmarks! What are you talking about ‘not the same function’, I made sure to pass float not double!” (a libc example to illustrate the point). And it’s not clear to me you realized the difference before akssri pointed it out, which renders the benchmarks pretty meaningless.
Where did you get that? Even the main title is refferring to the main point being that CuPy does not accelerate NumPy even in the case where it should be absolutely expected to. Then I used my implemetation to demonstrate that indeed GPU implementation for such a huge matrix should be many times faster.

I never claimed that my library aims for being a replacement for CuPy, or to have any compatibility with NumPy.

It would be more valuable to CuPy developers if I debugged CuPy to discover why that problem exists, but why should I be obligated to? I was writing this for a perspective of a user of these libraries.

Would it be an idea to benchmark Neanderthal with float64 as well, just to gather some data on it?

I agree with both of you, you’re both looking at this thing from a different perspective. It’s perhaps just better to gather timing measurements on a few variants with the trade-offs that each library has made, and how that affects implementation / speed.

The flaw, in my opinion, is not necessarily in the benchmark but in the claims and observations around the benchmark. For instance the blog flat out says:

> Nope, we work with float32.

It seems that is not true. The blog should make clear that it isn't straightforward or possible to get cupy to do this.

Can you please provide a benchmark for corrcoef where CuPy is noticeably faster than NumPy on your (and mine) GPU, Nvidia GTX 1080Ti?