| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by atairov 1016 days ago
	Hi. Thanks for commenting on this. You're correct llama2.c was built with runfast that doesn't execute on cores via OMP. This made comparison fair, since in Mojo the parallelize helper wasn't used as well. I think one of the reason why llama2.c isn't performing better, it's because so far it doesn't have SIMD instructions support. And it seems that the SIMD implementation could make overall complexity of run.c quite bad. While the essential purpose of llama2.c was determined as education. In the other side llama2.mojo as Mojo ecosystem also is in it's early stages. I'm researching how to implement full set of improvements offered by Mojo.

1 comments

version_five 1016 days ago

Thanks for clarifying. I'm interested in what C is leaving on the table in terms of performance. I saw your github implementation, I'd suggest you try submitting it as a show HN if you didn't already. (Looks like you did submit it, try it again with Show HN: and maybe more people will notice).

I noticed that it says mojo is using six threads. Is that across cores or is it something else? Do you know what it's running in different threads?

I also saw some discussion in the llama2.c issues about using BLAS for the matmul. I'd be curious to know what speedup this gives.

atairov 1015 days ago

I'm not that much in context regarding BLAS. People are trying to optimize the code as much as possible, but some optimizations are not approved to be merged due to over-complexity in the code understanding.