Hacker News new | ask | show | jobs
by atairov 1016 days ago
Hi. Thanks for commenting on this. You're correct llama2.c was built with runfast that doesn't execute on cores via OMP. This made comparison fair, since in Mojo the parallelize helper wasn't used as well. I think one of the reason why llama2.c isn't performing better, it's because so far it doesn't have SIMD instructions support. And it seems that the SIMD implementation could make overall complexity of run.c quite bad. While the essential purpose of llama2.c was determined as education. In the other side llama2.mojo as Mojo ecosystem also is in it's early stages. I'm researching how to implement full set of improvements offered by Mojo.
1 comments

Thanks for clarifying. I'm interested in what C is leaving on the table in terms of performance. I saw your github implementation, I'd suggest you try submitting it as a show HN if you didn't already. (Looks like you did submit it, try it again with Show HN: and maybe more people will notice).

I noticed that it says mojo is using six threads. Is that across cores or is it something else? Do you know what it's running in different threads?

I also saw some discussion in the llama2.c issues about using BLAS for the matmul. I'd be curious to know what speedup this gives.

I'm not that much in context regarding BLAS. People are trying to optimize the code as much as possible, but some optimizations are not approved to be merged due to over-complexity in the code understanding.