What statically typed language would you suggest for machine learning and large data pipelines? I don't love Python, but it has by far the largest ecosystem.
It’s still dynamic in nature. But you can tune how much staticity you want. The spectrum goes from Python to C in terms of staticity. And with tools like JETLS.jl maturing you get a lot of the benefits static analysis.
The data pipeline ecosystem is starting to rival that of R and Python. The fact that you can just use Julia functions while keeping the performance allows you to avoid those weird vectorization gymnastics. The ML ecosystem is also in a great state. JUMP.jl, Touring.jl, the whole SciML ecosystem, autodiffing and gpu computing are all close to best in class in terms of quality.
The NN side of ML is a a bit weaker, but just for lack of developer time investment into that side of the ecosystem.
I use Julia! I like it a lot, and add type parameters to all application code. But JET.jl does not feel anywhere close to the assurances I can get from a statically typed language (yet)
It is brutal. I can say with first hand experience: The APIs for Pandas and NumPy are awful and insanely dynamic. As a result, it is frequently difficult to know what is allowed with calling a method. It is exhausting. Since many methods are "hyper-dynamic", many of the error messages are unhelpful.
Well, that's the curse of machine learning: since everyone uses Python you have to deal with Python. Even though Python isn't very nice when things start to get serious and you don't want to spend your time fiddling with noise just to make something work at scale.
I'd wish the ML/AI/LLM crowd would see that it is in their interest to get better developer ergonomics at scale. (I don't want to have to turn to C++)
The ML/AI ecosystem is a minefield, and pure Rust rewrites (Candle, Burn, ...) are still immature and incomplete. But I'm pretty sure we're eventually going to see the same uptake that's already happening in the data processing world.
The performance is not the (only) issue. The issue is the death by a thousand cuts involved in distributing Python programs without a two page set of instructions that have to be followed to make it work. It is rarely "just works" unless you can make a lot of assumptions about the environment it runs in. It's why I generally steer away from any application written in Python. It is going to be painful.
However my experience might be a bit different since I actually have to deal with Python at scale and in a fairly dynamic environment.
I need to run hundreds of Python programs, written by dozens of programmers, over many years, that speaks to custom hardware, runs on a remote site, in a production environment that has to work and with new versions of things coming in all the time. Some of these Python programs not only link with C libraries, but run external binaries because developers didn't have time to integrate them as libraries because it takes forever to make it work on all the different os/arch combinations (easier to just run the C code in subprocesses).
This runs on three different CPU architectures (we're trying to eliminate one of them), two different operating systems and a pretty wide mix of hardware and system configurations I need to insulate Python from. Much of the hardware being custom built stuff. Because Python has a lot of exposed surface to the OS compared to a statically linked binary. (Roughly 100x the surface of statically linked binaries that don't link with libc -- which is evident by the insanely bloated OCI images that result from packaging what you need to run)
Modern compiled languages that have sorted toolchains makes it pretty easy to produce "production grade" os/arch specific binaries that can survive almost everywhere. You compile build a statically linked binary for each architecture to overcome the challenges of varied Linux runtime environments (see Linus T's frustrations with Linux and software distribution - it's not like it is easy to begin with). Go and Rust do this well.
So you end up having to containerize everything in ephemeral containers to lock down the execution environment while retaining some speed. But of course it isn't that simple, because if you depend on access to weird hardware and/or you run on custom built machines you have to detect this and ensure the application inside the container gets access to the things it needs from the container. So you have to fix that.
In a way that is almost completely invisible to the developer.
All of this has to be understandable and _reduce_ complexity for developers and operators so at the very least you don't follow the Python philosophy of "just throw another layer of complexity on it and make the instructions another page longer".
40-50kLOC later (in Go and Python, I have lost count) of code to try to make the problems go away, and I have something that is on the verge of actually being usable in a production environment for taming wayward Python code.
The easiest fix? If people could stop using Python because they don't want to learn a language that can produce something that is easier to distribute to users.
Believe me, I have spent months now trying to make Python work properly in a challenging environment. The only way this "worked" before was by just lowering standards to where the definition of "works" is flexible enough to count daily dumpster fires as "nominal". And of course people don't care. Python fosters a "it works for me" mentality where people don't know and don't care what it is like to be on the receiving end.
90% of problems I have because of Python would just disappear if people used languages that can produce robust binaries with limited exposure to system peculiarities. But that kind of requires people to understand why it is a problem in the first place. And people generally don't bother to know.
No, you are right about it not being limited to Python. But for python the common courtesies I am used to right out of the box tend to require extra effort on part of the programmer. And «extra» doesn’t usually happen.
Even C, with its ancient, haphazard, ugly, fragile, awkward toolchain, can often trivially produce binaries that will just work with very little effort.
I have spent decades of my life writing tooling, libraries and infrastructure. And no matter where you go, developers only do the bare minimum if they can get away with it. That doesn’t mean they are bad people. It means tools and infrastructure has to be designed with acute awareness of reality.
Python has been around for 35 years. And it still hasn’t evolved things we should take for granted today despite its increase in adoption. To me that’s pretty fucking awful project governance.
Cython is a niche language for writing perf-critical bits inside your Python codebase. It's like C for people who don't want to learn C. At least that's how I treated it, when I had to write some stuff to make some numpy ops faster.
Cython is not in any real sense a replacement for a modern data/ml stack.
However, I think an ML designed for machine learning would be nice, especially if the type system is extended to multidimensional arrays shapes. Pattern matching on array shapes would be rather nice. Ocaml style interactive mode for exploration and compiling for performance would be nice too.
LLMs are leveling the developer experience and productivity in a way that makes Python's strengths almost irrelevant, while it's still suffering from bad tooling (even with uv and friends) and poor performance.
AI/ML: interfacing with C++ libraries directly (or in Rust) is now a real option. For everything else, even 5 years ago I wouldn't have used Python, now there are even fewer reasons to do so. As far as I'm concerned the remaining use cases are notebooks and one-shot scripts.
> With writing code in english now, why have it use a slow weak language?
Because the feedback loop of writing few lines of Python inside Jupyter cell is much shorter than with your currently favorite AI tool. It costs less too.
> What statically typed language would you suggest for machine learning and large data pipelines? I don't love Python, but it has by far the largest ecosystem.
Pay no attention to OP. It's nonsensical to even suggest you should migrate away from a whole tech stack just because you want to run static code analysis, specially when the argument is based on having too many static analysis tools to chose from. Utter nonsense.
It’s still dynamic in nature. But you can tune how much staticity you want. The spectrum goes from Python to C in terms of staticity. And with tools like JETLS.jl maturing you get a lot of the benefits static analysis.
The data pipeline ecosystem is starting to rival that of R and Python. The fact that you can just use Julia functions while keeping the performance allows you to avoid those weird vectorization gymnastics. The ML ecosystem is also in a great state. JUMP.jl, Touring.jl, the whole SciML ecosystem, autodiffing and gpu computing are all close to best in class in terms of quality. The NN side of ML is a a bit weaker, but just for lack of developer time investment into that side of the ecosystem.