Hacker News new | ask | show | jobs
How Mojo gets a speedup over Python – Part 2 (modular.com)
94 points by CoreyFieldens 1032 days ago
17 comments

I'm really interested in Mojo not for its AI applications, but as an alternative to Julia for high performance computing. Like Julia, Mojo is also attempting to solve the two-language problem, but I like that Mojo is coming at it from a Python perspective rather than trying to create new syntax. For better or for worse, Python is absolutely dominating in the field of scientific computing, and I don't see that changing anytime soon. Being able to write optimizations at a lower level in a Python-like syntax is really appealing to me.

Furthermore, while I love Julia the language, I'm disappointed in how it really hasn't taken off in adoption by either academia or industry. The community is small and that becomes a real pain point when it comes to tooling. Using the debugger is an awful experience and the VSCode extension that is recommended way to write Julia is very hit-or-miss. I think it would really benefit from a lot more funding that doesn't actually seem to be coming. It's not a 1-to-1 comparison, but Modular has received 3 times the amount of funding as JuliaHub despite being much younger.

They already failed once with Swift for Tensorflow, so I am currently curious if there will be some lessons learned from that effort.

For the time being, my chips are still on the Julia horse.

I was responsible for the S4TF effort at Google. In my opinion, it validated that some of the ideas are good (e.g. Graph Program Extraction is the algorithm that torch dynamo uses internally), that an efficient compiled language has benefits etc. However, I also learned that it should not be based on Swift and should not be based on TensorFlow. Other than those two things, everything is great ;-)

More on GPE if you're curious: https://llvm.org/devmtg/2018-10/slides/Hong-Lattner-SwiftFor...

Thanks for replying, and the clarification.
I’m a huge Julia fan, you can take a look at my posting history. I love Julia’s syntax, and some of its language ideas.

…BUT…

For my personal tastes, Mojo’s lack of garbage collection, Rust-like memory safety, and attention to ahead-of-time compilation put it way ahead. The vast pool of Python developers who can easily pick it up if interested is a big plus.

Julia is aimed at a somewhat different space, but there’s also a huge overlap.

Let’s hope for good interoperability between the two, it seems fairly straightforward…

Lets see how it plays out, given that they are focused only on AI workloads, and somehow those VCs want their money back, which doesn't appeal to everyone.

I acknowledge that there is finally pressure in the Python community to tackle down performance, but don't see Mojo being the solution unless there is something that it will make it go wild.

Right now, I see that more likely with Facebook, NVidia, Intel and Microsoft efforts.

At least they included numpy in this one. On their last post, after all their optimizations, numpy.matmul() produced almost the exact same throughput as their most optimized example. Would still need to dig in to see if this one has issues. Benchmarks are always such a minefield.
matmul is a wrapper for BLAS. If you're faster than BLAS you're beating handwritten assembler code specialized per CPU architecture.
But people use numpy for matrix multiplies in Python. Unless they are claiming to be 35k times faster on general-purpose code, the 35k number is absurd.
A lot of ugly,unreadable code has come into existence because of the need to twist it into NumPy calls. If you can replace these with good old for loops and achieve similar performance, then you've already won. Besides that, there are a lot of code that involves looping that isn't matrix multiplication or covered by NumPy.
Right; but the point is that the optimizations didn't require an entirely new language; you just take the core logic and write it in an existing language that has decades of optimizations. If you're doing math; there's likely a natural, well defined interface that can be used, so you just call that interface from Python, which has historically always been the point of 'glue' languages :)
I'm pretty excited about Mojo and have been keeping an eye on it's development. I feel like the team has learned a lot from their experience, and are taking the best from languages like Python, Rust, Swift, Hylo (Formerly known as Val), and are taking a really nice pragmatic approach in implementing them so that the language is approachable, but also very safe and fast. Once it's out, I hope someone sits down and makes a SwiftUI-like cross platform UI library with it ;).
Yeah, I've been following and am interested too.

Actually more interested in things like UIs, quick API servers, stuff like that than the AI/ML use cases. The idea of most of the ease and approachability of Python, a proper type system, and access to the entire ecosystem of Python libs in a compiled language is pretty compelling.

I agree, I'm excited to use it as a General Purpose language, and see how far the Autotuning feature can go for just normal old apps and servers.
Still waiting if all of this will be another Swift for Tensorflow, or actually make a difference.
35Kx speedup is not scaled speedup. Throw this, naively parallelizable task at a bigger computer and get 70kx speedup, etc.

While i think there are tons of optimizations to be done for python (looking at you GIL) giving access to low level cpu primitives is not one I think that will be broadly adopted by the python community. That's one of the joys of python: system agnostic, looks pretty close to pseudocode, coding. If you want speed, glue together a bunch of compiled code calls, and hope the call overhead isn't too large. Or write cpu intensive operations in numba, or pyrex. At the end of the day, mojo's pay to play programming language harkens back to the early 90's Borland days.

> 35Kx speed up is not scaled speed up.

Right. However, this is a comparison versus Python and the GIL, which can’t do that at all.

> While i think there are tons of optimizations to be done for python (looking at you GIL) giving access to low level cpu primitives is not one I think that will be broadly adopted by the python community.

It doesn’t need to be, any more than writing Numba or Pyrex is done on a large scale.

> That's one of the joys of python: system agnostic, looks pretty close to pseudocode, coding. If you want speed, glue together a bunch of compiled code calls, and hope the call overhead isn't too large. Or write cpu intensive operations in numba, or pyrex. At the end of the day, mojo's pay to play programming language harkens back to the early 90's Borland days.

The appeal is having a high level language that compiles to efficient machine (and GPU!) code. One can “drop down” to Python for non performance intensive parts.

I think this will be much more of a draw for people coming from C++, Fortran and other older, jankier languages. It looks to hit a sweet spot for real time embedded development VERY well, especially given Rust-like memory safety!

Mojo will also be a worthy competitor to Julia in the HPC scientific arena I think…we’ll see!

Have you played with Mojo? It really doesn’t feel high level.

I feel like JAX has been eating Julia’s lunch lately, making me think that there’s a real market for a small functional differentiable programming language with good Python interop - like a more polished Dex or Futhark.

> Have you played with Mojo?

Yes.

> It really doesn’t feel high level.

Does Python “feel high level”?

Mojo is a proper superset of Python.

Particular functions may deal with low-level machine features, that is unavoidable when extracting maximum performance from hardware. Mojo is pursuing some innovative ideas there, such as autotuning and adaptive compilation.

As I said in a different post, I don’t think Mojo’s main audience is the general Python community, it’s the AI, real time, embedded, safety critical, HPC, and yes, gaming, communities that’ll likely benefit the most.

> Mojo is a proper superset of Python.

Isn't it more that the plan is someday Mojo may be a proper superset of Python, but right now it is far from it? I just tried opening up the Mojo playground, copy/pasted the very first little example function from the official Python tutorial (see https://docs.python.org/3/tutorial/controlflow.html#defining...) and Mojo outputs a bunch of errors.

With Cython, our goal was to make it a proper superset of Python, and it was really difficult, but we got close.

> Right. However, this is a comparison versus Python and the GIL, which can’t do that at all.

Single process python does not take advantage of a multicore architecture but neither would single process mojo. Embarrassingly parallel operations like mandlebrot can trivially be written with multiprocessing (https://github.com/DipanshuSehjal/Mandelbrot-set/blob/master...), or joblib to run in parallel in otherwise vanilla python. It would be trivial to implement this in jax and run on a gpu or tpu, but i wouldn't say that jax is the reason for the speed up.

> Single process python does not take advantage of a multicore architecture but neither would single process mojo.

That is exactly not the case. The Mandelbrot demo IS a single process, multi-threaded, SIMD-enabled Mojo program that uses 88 processor cores.

> At the end of the day, mojo's pay to play programming language harkens back to the early 90's Borland days.

I didn’t address this in my other post. Modular is about to release a freely available SDK. Also, the standard library sources will be open sourced shortly. There are hints of additional open source initiatives.

Modular’s main business plan appears to be adding value in the general area of AI, AI training, and AI deployment, including by offering SAAS. That plan in no way conflicts with (and in fact encourages) an open Mojo language ecosystem.

that is good to hear. I read a post on Mojo months ago, signed up to the waitlist and then crickets. It would seem insane to think a non-open source, non-free compiler/interpreter could be successful these days.
Mojo needs to demonstrate Hugging Face's AI libraries with Mojo acceleration. Nothing else will have the kind of impact that would have.

Throw a half dozen engineers at it, develop a deployment plan for SD XL, profit.

You'll get a ton of open source developers working on improving the Mojo versions even further once you release it, researchers developing extensions, etc. GO TO WHERE THE DEVELOPERS ARE.

Stable Diffusion is crazy compute heavy, so if Mojo is what it's purported to be, it should be possible to get speedups.

They lost me with the emoji for file extension. That’s not a world I want to live in.
This, while being an apparently superfluous complaint, would be important for eventual enterprise adoption.

Other languages have failed for less visible reasons.

You don’t have to, “.mojo” is equivalent.
But I use DOS...
You could get a vps on the cloud
How to access it over Netbios?
I don't understand the play here for Modular. If this is a worthwhile improvement that is broadly applicable, won't it at some point make it's way into Python, numpy, etc?

In Java land we had a bunch of other JVMs over the years offering better performance. Most important things got absorbed into what is now OpenJDK, and the other JVMs, if they even exist at all, are niche players.

Performance is a huge focus in Python and ML lands right now, so why would this be any different?

They aren't just speeding up existing python code, they are making a superset of python which has additional performance features.

I guess it's possible that these features will be introduced into cpython etc. but I doubt it.

Just based on their website, I think selling Mojo as a faster Python-like language isn't intended to be their main product. They place a lot more emphasis on AI/ML acceleration than on Mojo, and on creating compatibility between different AI hardware acceleration systems.

I have the impression they hope vendors of AI acceleration hardware, clusters and cloud services will be their customers, to provide uniform and heavily backward-compatible cross-acclerator AI/ML APIs to those vendors' customers.

And hope that users of those services and hardware will also pay for high quality well-researched APIs that work reliably with many different AI/ML accelerators, even if Mojo is free. Similar to how RedHat provides value through commercial-grade QA and sustained development for Linux on high-end hardware, that would be complicated and risky to use otherwise.

If they've figured out how to deliver performance that Python might get around to in 5-10y, shouldn't they tout that, for people who might want that now?

Ultimately promoting the possibility for better performance, & current contrast, is good for prodding other languages/runtimes like Python to match these options. The "important things [get] absorbed" process you mention relies on teams making some "play for" alternatives, to create the impetus to get new things integrated.

Totally, just trying to understand why this is a $100MM of VC money investment. Is the market that big for this? (Honest question)
Modular is mainly focused on improving AI related workflows as its business model. That market is easily many $billions, and I think most expect the AI industry to experience explosive growth.
I feel like there’s 100m of VC money here because it’s Chris Lattner’s company and he’s the best compilers person in the world right now.
Most famous in Silicon Valley, maybe?

Kotlin is similar to Swift but arguably compiles much faster despite a suboptimal initial architecture, and avoids weird language/compiler specific problems never before seen, like expressions that time out whilst compiling.

Graal is similar to LLVM but can compile a far larger range of languages, is actually used for both JIT and AOT compilation (does anyone use llvm jit in prod?), and has many innovations LLVM never could have even tried to have.

So it's not really clear that he's the best compiler person in the world. More like, the people doing the other stuff aren't in California so don't get the same level of attention.

One of, yes.

Not the best, and already has failures like Swift for Tensorflow.

This is an investment in the team not the idea.
Plenty of OpenJDK alternatives still exist, just like there are several C and C++ compilers.
Cool, but it has very little to do with Python, except some similar looking syntax.

So for a Python programmer with a performance problem, it doesn't look like a solution.

They are also building in pretty serious Python interop. You should be able to at least somewhat mix the two or migrate gradually, and still use Python libs for less performance critical code (or if the libs do their performance critical stuff in C++ or whatever and are therefore fast enough).
I just want to see real un-hyped benchmarks. Comparing random Python native code makes no sense and seems dishonest, deterring me from actually trying out the tool.

I want a Python that can statically plan underlying GPU allocations, avoids CUDA kernel dispatch overhead and enables a multi-GPU API that isn't some multiprocessing abomination.

A Python with easy-to-use SIMD and multithreading sounds awesome!
Why is this a language superset of python rather than a python library? Genuinely asking and not trying to bash
That sounds intractable.

How would you differentiate mojo code from vanilla python without a ton of boilerplate at language boundaries.

As a high-performance computing person, I'm usually I/O bound, not compute bound. I wish someone would come up with a 10x speed up for disk and network I/O.
So TL;DR: Using SIMD and multithreading is faster than doing no optimization in python. The only real comparison here is when not doing any optimization is:

> The above code produced a 90x speedup over Python and a 15x speedup over NumPy as shown in the figure below:

Am I missing something?

Getting >10x speed up isn’t exciting enough for many people?

I’ll take it.

This is all pretty impressive if I can take my unmodified (slightly modified?) Python code and get that sort of improvement.

> This is all pretty impressive if I can take my unmodified (slightly modified?) Python code and get that sort of improvement.

it'll never work as smoothly as they advertise. just hands down, beyond a shadow of a doubt, their claims about supporting "unmodified" Python code are startup hype. how do i know? i could give you a bunch of technical reasons about Python as a language and CPython as the de facto implementation (thereby informing tons of code already written, re extensions) but there's a much simpler way to reason about it: because there are already >10 attempts at this and no one has been able to do it. there's no magic here that any number of dollars or brains could pull off. instead each such project picks a point on the pythonic<->performant design-space tradeoff curve and then asks/expects you to live with that choice.

and taking ^ into consideration, mojo is not that special. only thing going for it is chris lattner isn't bad at designing languages so maybe, on its own, it'll be a nice language (but it needs to be open to get any traction on its own).

It's not 10x but GraalPy can speed up unmodified Python by 3.4x on average:

https://www.graalvm.org/python/

And they've not been going at it that long. A few years at most.

graalpy does not fully support C extensions and will have just as hard a time extending support as anyone else. maybe even the hardest because they're plumbing through the JVM which, notoriously, has bad C FFI (at least until recently?).
It's incomplete but it does support C extensions and can run code with NumPy and other science modules.

Their approach is unique which is why it can work (they proved out the idea with ruby already). They compile the modules with LLVM and then extend the Python interpreter/JIT compiler with support for LLVM bitcode. So the JITC compiles both Python and C extensions together as one unit. The interpreter API is then virtualized so that code that looks like a structure read or method call from C is compiled directly down to the optimized machine code being used by the rest of the JITC. In this way the interop overhead can be optimized out.

This is all separate tech that goes well beyond a normal FFI. JNI doesn't even get involved at all.

> i could give you a bunch of technical reasons about Python as a language and CPython as the de facto implementation

Please do. I'm very interested.

> no optimization in python

Well, isn't that most Python? If Mojo can pave over the slow interpreted bits I repeatedly dig up in Python profilers, even well maintained projects, with no code changes, that would be huge.

So does this mean Swift and Metal offers the same if not better performance enhancements? SIMD is very much a first class citizen as a type there
No, Lattner learned from Swift and is avoiding anything except zero-cost abstractions.

Also, Swift isn’t very interesting outside the Apple ecosystem, and Metal doesn’t exist outside the Apple ecosystem. Mojo has a real shot at widespread, general-purpose, language adoption!

Good blog post. I do wonder how it would do compare to an implementation of pycuda.
nit: The text says 743x but the graph (Figure 3) shows 527x
I don’t understand this from a goals perspective. What is an “AI compiler” - and why aren’t they comparing benchmarks with technologies more commonly used in AI?

I think I should be impressed, but I feel like I’m missing the point.

I guess the point is that getting the same performance in most other languages requires hundreds of lines of code. Here they are ostensibly achieving that performance using very succinct code. That is pretty nice especially if it integrates well with Python.