Hacker News new | ask | show | jobs
by cdcfa78156ae5 2830 days ago
> If the code is written in a distributed fashion from the start then it can be designed so it's one python/node/R process per core.

That's a really glib dismissal of how hard the problem is. Python and node have pretty terrible support for building distributed systems. With Python, in practice most systems end up based on Celery, with huge long-running tasks. This configuration basically boils down to using Celery, and whatever queueing system it is running on, as a mainframe-style job control system.

The "shell scripts / xargs -P" mentioned by chubot is a better solution that is much easier to write, more efficient, requires no configuration, and has far fewer failure modes. That is because Unix shell scripting is really a job control language for running Unix processes in parallel and setting up I/O redirection between them.

3 comments

Am I correct in assuming Elixir/Erlang does a much better job at this compared to Node/Python/etc., putting aside (what I understand to be) the rather big problem of their relative weakness for computation?
Erlang can be a good fit -- the concurrency primatives allow for execution on multiple cores, and the in memory database (ets) scales pretty well. Large (approaching 1TB per node) mnesia databases require a good bit of operational know how, and willingness to patch things, so try to work up to it. Mnesia is erlang's optionally persistent, optionally distribution database layer that's included in the OTP distribution. It's got features for consensus if you want that, but I've almost always run it in 'dirty' mode, and used other layers to arrange so all the application level writes for a key are sent to a single process which then writes to mnesia -- this establishes an ordering on the updates and (mostly) eliminates the need for consensus.
I believe with the combination of native functions ("NIFs") in rust and some work on the nif interface (to avoid making it so easy to take down the whole beam VM on errors) - you might get more of best of both worlds today - than you used to. As you say erlang itself is rather slow wrt compute.
Thankfully it's not an issue for me. Elixir/Erlang is pretty much perfect for most of my use-cases :). But I foresee a few projects where NIFs or perhaps using Elixir to 'orchestrate' NumPy stuff might be useful. Most of my work would remain on the Elixir side though.
Yes. Erlang is built around distributed message sending and serialization. Python does not have any such things; even some libraries like Celery punt on it by having you configure different messaging mechanisms ("backends" like RabbitMQ, Redis, etc.) and different serialization mechanisms (because built-in pickle sucks for different uses in different ways). Node.js does not come with distributed message sending and serialization either.
> With Python, in practice most systems end up based on Celery, with huge long-running tasks.

Oh dear... Yeah, that's a terrible distributed system. Interestingly, all the distributed systems I've worked on with Python haven't had Celery as any kind of core component. It's just poorly suited for the job, as it is more of a task queue. A task queue is really not a good spine for a distributed system.

There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.

> There are a lot of python distributed systems built around in memory stores, like pycos or dask, or built around existing distributed systems like Akka, Casandra, even Redis.

Cassandra and Redis just mean that you have a database-backed application. How do you schedule Python jobs? Either you build your own scheduler that pulls things out of the database, or you use an existing scheduler. I once worked on a Python system that scheduled tasks using Celery, used Redis for some synchronization flags, and Cassandra for the shared store (also for the main database). Building a custom scheduler for that system would have been a waste of time.

> Cassandra and Redis just mean that you have a database-backed application.

Oh there's a lot more to it than that. CRDT's... for example.

Well, Celery uses Rabbit-MQ and typically Redis underneath. Rabbit-MQ to pass messages, and Redis to store results.

You can scale up web servers to handle more requests, which then uses Celery to offload jobs to different clusters.

Yeah, but fundamentally Celery is a task queue. You don't build a distributed system around that.
I think the intention was that if you're gently coerced into working with a single thread, like with node, then you're also coerced into writing your code in a way that's independent from other threads. In theory, it's easier reasoning about doing parallel work when you start from this point - I've certainly noticed this effect before.

I don't think any reasonable developer would dismiss concurrency/parallelism as easy problems.