Hacker News new | ask | show | jobs
by acmiyaguchi 1178 days ago
The language is not a limiting factor here. Python is an excellent scripting language, and works plenty fine in distributed computation. The Python interface to Spark is a wrapper on the underlying Scala API. You don't lose out on performance when you're building up a lazy chain of computation that's executed by an engine written in a more performant language.

Fugue is a layer to abstract out these distributed computation backends, and it looks like a nice programming interface.

2 comments

Well said! Python can push down to other languages like Rust and C to speed things up. Python can serve as a great end-user interface.
Trying to wrap my head around Fugue and the comment explaining how Python is a good wrapper.

Does Fugue take advantage of each sublayer that already uses Arrow?

Yes we do. We try to leverage Arrow as much as we can. And different from many other computing frameworks, Fugue doesn't invent new data types, it just uses Arrow data types.
Python is not excellent in any domain it's used. But, yes, it's a problem with the language, on which I'll comment later.

First, present day users of Python need to understand how and why Python came under the spotlight. There was always a standoff between programmers who created unimaginative huge programs full of drool and red-taping, and programmers who wanted larger freedom of expression, less strings attached. The later group was usually the more savvy ones.

In a way very similar to how an art student might be spending months studying a model, using a whole bunch of pencils starting from 10B and ending in 10H, various chalks, coal sticks and so on... and would still produce a... "study of a model #80907", which is ugly, anatomically incorrect and just boring. And there's an accomplished artist who can just stick her finger into the chimney, grab some soot, and in a matter of minutes make a great drawing, which will be lively, expressive, you name it.

So... Python, and Perl before it were the soot. The junk languages a more experienced programmer would go to just to show those boring Java programmers "how it's done". But, Python programmers who came in the next wave thought that soot is the good tool to learn how to make good drawings. And, today, we have academies full of students trying very hard to draw models with materials and instruments which are very inappropriate for the task. (Unless you know anything about art education today, the example isn't that big of a stretch of what happened in it around 70s-80s.)

---

I don't care if Python is a glue code for Scala or C or Rust: it doesn't matter. Python, as a language, is inadequate for dealing with concurrency. It needs to remove a bunch of stuff before it can start adding stuff that can be used to that end. It's a language with a lot of mutation semantics which are hard to interpret / implement correctly (what would that even mean?) in distributed context. It's a language with a lot of implicit stuff going on that is somewhat useful (but is not useful enough) if you want to have a quick and dirty "sketch" quality code, but will be devastating in distributed context.

Things like decorators, context managers, imperative loops with break and continue, error handing mechanism, threading -- all of this must go before Python can start to think about becoming a decent language for distributed systems. But, probably more: I would need to research this in much depth to tell for sure if things like method calls would work well for example.

It's a waste of time to try today to fit Python into distributed computation. You will either have to put a humongous effort purging a better half of the language (making all of the famed support libraries useless), or you will end up with a defective hodge-podge mess (which is all those famed support libraries are, including those which aim to do distributed programming in Python).

I guess you want to say MPI and C++ are better for distributed computation?
Well... no. MPI is an attempt to fit a saddle on a cow, when it comes to using it with C++. The whole reason it exists is because C or C++ would really, really suck when it comes to distributed programming. So, it tries to save the day with minimal casualties. It's not "the" solution to the problem. Having to use Slurm / PBS and similar my impression from this toolset is... well, pain. But, at least, if you work through pain, you can probably make it go fast. And this is usually what matters to people in this situation.

A much better distributed platform is Erlang. When comparing it to something like Slurm, the benefit is that the programmer has superior access to sharing information between workers, the programmer has programmatic (and superior) control for choosing the workers.

The problem with people in WLM world (that's where Slurm, MPI, OpenMP and friends come from) is that they have never crossed paths with Erlang. They don't even know what they don't know. A lot of the stuff that has to do with the "distributed" part of their computations is perceived outside of their control, something outsourced to infra people: they just need to put a "machine file" somewhere, or use some HTML dashboard to select the nodes where their program needs to run, and hope for the best.

Also, the WLM world has a very peculiar take on distributed programming: they need it to run batches of jobs. Jobs start, finish, and, hopefully, produce results. They don't need to build interactive systems, or deal with upgrades, or create databases etc. So, it's just not an answer to more general problems associated with distributed programming.