Hacker News new | ask | show | jobs
by e-kayrakli 889 days ago
Re Intel support: That's definitely in our plans. However, there are also many other areas that we are actively working on to add more features, fix bugs and improve performance. When prioritizing we typically make decisions based on what our current and potential users might need in the language. Frankly, we are not seeing a big push for Intel GPU support so far. So, currently it is not near the top of our priorities. If you (or other readers) have any input on that matter where lack of Intel support might be a blocker for testing Chapel and/or its GPU support out, definitely let us know.

Re implicit serialization: To clarify; the serialization based on order-dependence is not implicit. The users should use `for` loop if their loop is order-dependent and `foreach` (and `forall`) if their loop is order-independent. In other words, the Chapel compiler doesn't make decisions about order-dependence. In particular for GPU execution: a `for` loop will never turn into a GPU kernel.

There are however some cases where a `foreach` does not turn into a kernel. You may be referring to those cases, but that's not related to order-dependence. Some Chapel features cannot execute on GPU. If your `foreach` loop's body uses any of those features then it will not be launched as a kernel even though `foreach` signals order-independence. Now, a subset of such features that makes an order-independent loop GPU-ineligible are there because we haven't gotten a chance to properly address them, yet. Another subset of such features will remain thwarters for a longer time and maybe forever. For example, your `foreach` loop could be calling an extern host function.

3 comments

Thanks for joining us! I enjoyed studying Chapel a while back. I have a few questions about it.

Most AI in the press is done on expensive NVIDIA’s. Many papers have techniques with lower cost or higher effectiveness. Their algorithms are described in high-level form with little or limited implementation. Many in OSS and non-DL are using smaller models that can run on diverse hardware, if one has expertise to program it. It would be helpful to have a language that maps high-level techniques in papers to diverse hardware for use in training or review.

Can Chapel currently implement the concepts in papers on NN’s and LLM’s to run on multicore, clusters, and GPU’s? If so, can it implement hybrids where specific functions are GPU optimized but the overall design is split across machines? If not, what is missing for using Chapel for rapid prototyping of AI concepts?

These are great questions, and ones we’re very curious about as well. I don’t believe that our current Chapel team has much experience programming NNs and LLMs, having focused on other areas. That said, I’m also not aware of any intrinsic barriers to implementing such algorithms in a portable way within Chapel, potentially calling out to vendor-optimized implementations when available and appropriate.

If you, or others, would be interested in exploring this topic, we’d be very interested in either partnering with you or supporting your efforts.

(Also see Engin's response about programming tensor cores for some thematically related thoughts: https://news.ycombinator.com/item?id=39020703 )

> If your `foreach` loop's body uses any of those features then it will not be launched as a kernel even though `foreach` signals order-independence.

Is this signalled/warned about so that you don’t accidentally use one of these features and kill your performance? Or a way to indicate that you specifically intend it to be run on GPU?

I'll also add to @danilafe's reply that we have a GpuDiagnostics module which would count kernel launches or report them as they occur in a section of code. Something like that can be used to debug parts of your code where you do or don't expect kernel launches to occur. See https://chapel-lang.org/docs/main/modules/standard/GpuDiagno...
It's not signaled / warned about by default. However, if you want to make sure your `foreach` loop runs on the GPU, you can use the `@assertOnGpu` attribute. For instance, the following program would not compile:

  @assertOnGpu
  foreach i in 1..10 { /* Do something that can't be done on a GPU. */ }
The compiler will print a message explaining why the loop was not eligible for GPU execution.
What about Apple Silicon / Metal support?
Thanks for bringing this up. I posted an answer to the same question here: https://news.ycombinator.com/item?id=39009566

But seeing that we already have two questions about it makes me think whether this should be something we should think more about when we are prioritizing work.

This so going to sound extreme, but it is true: I personally and professionally won’t use it for anything until it has Metal support.

The simple fact is that me and my team do our development largely on MacBook Pros and Mac Studios. We have some GPU rigs for running production code on Lunux or Windows with NVIDIA GPUs, but all of our developer tooling is on macOS. Anything that can’t run natively in recent Apple hardware gets second-class support.

That doesn't seem extreme to me, as I generally feel similarly. If you (or other readers) are genuinely interested in using Chapel with Metal, please open an issue on our GitHub repository capturing your request, as that would be valuable to us.

Just to make sure it didn’t get lost, note that it is possible to develop GPU code in Chapel on a MacBook using the cpu-as-device mode Engin mentions above, and then deploy it on NVIDIA GPUs on production systems by recompiling. This is how I develop/debug GPU computations in Chapel.