|
|
|
|
|
by atairov
1016 days ago
|
|
Hi. Thanks for commenting on this.
You're correct llama2.c was built with runfast that doesn't execute on cores via OMP. This made comparison fair, since in Mojo the parallelize helper wasn't used as well.
I think one of the reason why llama2.c isn't performing better, it's because so far it doesn't have SIMD instructions support. And it seems that the SIMD implementation could make overall complexity of run.c quite bad. While the essential purpose of llama2.c was determined as education. In the other side llama2.mojo as Mojo ecosystem also is in it's early stages. I'm researching how to implement full set of improvements offered by Mojo. |
|
I noticed that it says mojo is using six threads. Is that across cores or is it something else? Do you know what it's running in different threads?
I also saw some discussion in the llama2.c issues about using BLAS for the matmul. I'd be curious to know what speedup this gives.