Hacker News new | ask | show | jobs
by arcticfox 2721 days ago
> If you need multi core and distributed features (which is generally more common than you think) elixir is truly your friend.

Is it though? At least in my line of work I don't think I've ever run into this. I feel like I've always been able to distribute just fine with workers/queues. If I even suspected it would I'd look into it more, but generally I find distributing across systems to be a software architecture-level and not language-level work; perhaps I'm missing something, however.

4 comments

Because those features are accessible, you end-up using them a lot more frequently and find new and exciting ways to use them. For example, I am really happy that Elixir and its tooling does pretty much everything using all cores: compiling code, running tests, generating documentation, etc. and all of this has a direct impact on the developer experience.

The other part is that you can build more efficient systems by relying on this. If you have a machine with 8 cores, it is more efficient to start a single process that can leverage all 8 cores and multiplex on both IO and CPU accordingly. This impacts everything from database utilization, to metrics, third-party APIs, and so on.

The Phoenix web framework also has great examples of using distribution to provide features like distributed pubsub for messaging and presence without external dependencies.

However, when it comes building systems, then I agree with you and I would probably use a queue, because you get other properties from queues such as persistence and making the systems language agnostic.

I hope this clarifies it a bit!

Exactly right, just the pure fact that it's at your disposal makes you think about problems in a whole new way. For example, I recently just abstracted my solutions away from Redis. I have nothing against redis however removing a dependency makes things more simple, which is important for my setup.

For caching you have SO many options which are already built in, ets, agent, ets + gen_server, or even reaching out for a library like nebulex.

Another example is recurring jobs, you can create a gen_server that will run a job every x hours in roughly 50-100 lines of code depending on how complex your problem is.

I rarely feel the need to reach out for external dependencies. Not that there is anything wrong with that, it's just that you now have a wider array of tools to work with and some problems are just solvable with a couple of lines of code instead of having to reach out for external tooling.

I recently built a distributed work queue, generally in Ruby I would use something like sidekiq, however elixir made me feel like "hey you can write your own" which generally is not recommended since you don't want to re-invent the wheel, but if you are doing something that breaks away from the existing solution, having the ability to craft a custom solution that works for your specific set of problems is extremely powerful, you can get much more creative, and the important thing is it got done fast (I wrote a distributed job scheduler in 2 weeks + 1 week to clean it up and work out the kinks, it's already running stably in production)

Cool, thanks José, that helps indeed. The multi-core DX stuff sounds interesting even on its own, production code aside. I hope to check it out soon!
Until I used Elixir, I thought workers/queues were enough. But after the last nearly-three-years, I've actually fallen into a place where workers/queues are almost always strictly inferior.

Workers/queues in languages like Ruby have problems like,

* Require very specific ergonomics(for example, don't hand the model over, hand over the ID so you can pull over the freshest version and not overwrite)

* They require a separate storage system, like your DB, Redis, etc. This doesn't sound big, but when doing complex things it can turn into hell.

* They have to be run in a separate process, which makes deployment more difficult.

* They're slow. Almost all of them work on polling the receiving tables for work, which means you've got a lag time of 1-5 seconds per job. Furthermore, the worse your system load, the slower they go.

* You can't reliably "resume" from going multi-process. Lets say you're fine with the user waiting 2-3 seconds to have a request finish. With workers/queues, you either have to poll to figure out when something finished(which is not only very slow, but error prone), or you have to just go slow and not multi-process, making it into a 8-10 second request even though you've got the processing power to go faster.

So, you've got all that. Or in Elixir, for a simple case, you replace `Enum`(your generic collection functions) with `Flow` and suddenly the whole thing is parallel. I mean that pretty literally too- when I need free performance on collections, that's usually what I do. Works 95% of the time, and that other 5% is where you need really specific functionality anyway, and for those, Elixir still has the best solution to it I've ever seen.

Erlang/Elixir has some really great advantages in concurrency and parallelism but what you're describing are just badly designed systems.

Shopify, for example, use Resque (Ruby + Redis) to process thousands of background jobs per second.

> * Require very specific ergonomics(for example, don't hand the model over, hand over the ID so you can pull over the freshest version and not overwrite)

This is good practice but certainly not a requirement. You can pass objects in a serialized format like JSON or use Protobuf etc.

> * They require a separate storage system, like your DB, Redis, etc. This doesn't sound big, but when doing complex things it can turn into hell.

ETS and Mnesia aren't production ready job queues, unfortunately: https://news.ycombinator.com/item?id=9828608

> * They have to be run in a separate process, which makes deployment more difficult.

Background tasks have different requirements so this is a good idea regardless.

> * They're slow. Almost all of them work on polling the receiving tables for work, which means you've got a lag time of 1-5 seconds per job. Furthermore, the worse your system load, the slower they go.

Redis queues have millisecond latency and there's no polling. Resque and Sidekiq use the BRPOP to wait for jobs. BRPOP is O(1), so it doesn't slow down as the queue backs up.

PG has LISTEN/NOTIFY to announce new jobs or the state change of an existing job so there's no need to poll. SKIP LOCKED also prevents performance degrading under load.

> * You can't reliably "resume" from going multi-process. Lets say you're fine with the user waiting 2-3 seconds to have a request finish. With workers/queues, you either have to poll to figure out when something finished(which is not only very slow, but error prone), or you have to just go slow and not multi-process, making it into a 8-10 second request even though you've got the processing power to go faster.

There are multiple other options here which are better:

Threads - GIL allows parallel IO anyway and JRuby has no GIL

Pub/Sub - Both Redis and PG have a great basic implementation usable from the Ruby clients

Websockets - Respond early and notify directly from the background jobs

> This is good practice but certainly not a requirement. You can pass objects in a serialized format like JSON or use Protobuf etc.

ie, requiring very specific ergonomics. If you have to change what you're doing, it's a new domain to learn.

> ETS and Mnesia aren't production ready job queues, unfortunately: https://news.ycombinator.com/item?id=9828608

I didn't mention ETS or Mnesia? The OP was talking specifically about using job queues to get concurrency/parallelism, in which case you absolutely don't need job queues. If you need a job queue, you need a job queue.

> Background tasks have different requirements so this is a good idea regardless.

Why? You're just stating this like it's obviously true, and honestly I can't think of a time I significantly wanted a different system doing my jobs than the one handling requests.

> Redis queues have millisecond latency and there's no polling. Resque and Sidekiq use the BRPOP to wait for jobs. BRPOP is O(1), so it doesn't slow down as the queue backs up.

Redis queues have millisecond latency, Ruby using Redis queues does not. That's the part that polls when there's nothing else going on. If you're never running out of jobs to do then your latency is fast, but you're also not accomplishing things as fast as possible(since it's waiting on whatever is in front of it).

If this isn't true anymore, then alright, but last I used Sidekiq(early 2018), the latency to start processing a job was often greater than a second.

> Threads - GIL allows parallel IO anyway and JRuby has no GIL

And are incredibly difficult to use and pass information back and forth(hence why Elixir exists at all- Jose Valim was the person implementing this on the Rails core team).

> Pub/Sub - Both Redis and PG have a great basic implementation usable from the Ruby clients

Can certainly work, to be honest I never tried this because of the complexity of initial setup and how green I was when I needed it.

> Websockets - Respond early and notify directly from the background jobs

Which Ruby has a lot of trouble maintaining performantly. When my original team went to use Rails5 sockets, we found we could barely support 50 sockets per machine.

---

It's worth saying, I'm not saying one shouldn't use Ruby- the place I work right now is a primarily Ruby shop, and my Elixir work is for event processing and systems needing microsecond response times. But, we've also built things in Elixir that normally I would use Ruby or JS for, and not only does it do well, but often it's write it and forget it, with deployment being literally "run a container and set the address in connected apps".

> ie, requiring very specific ergonomics. If you have to change what you're doing, it's a new domain to learn.

Resque & Sidekiq build this in by converting job arguments to JSON. There's nothing extra to learn.

> I didn't mention ETS or Mnesia? The OP was talking specifically about using job queues to get concurrency/parallelism, in which case you absolutely don't need job queues. If you need a job queue, you need a job queue.

Sorry, I thought you were talking about building a background job system in Erlang using out of the box OTP but it sounds like you're actually talking about trying to get parallelism in Ruby by doing RPC over Sidekiq? That's always a bad idea!

> Redis queues have millisecond latency, Ruby using Redis queues does not. That's the part that polls when there's nothing else going on. If you're never running out of jobs to do then your latency is fast, but you're also not accomplishing things as fast as possible(since it's waiting on whatever is in front of it).

Ahhh! When Mike Perham says "Sidekiq Pro cannot reliably handle multiple queues without polling" what this really means is a Redis client can only block on and immediately process from the highest priority queue. The lower priority queues are only checked when blocking timeout expires. There's no "check all queues and sleep" polling loop which adds artificial latency.

> And are incredibly difficult to use and pass information back and forth(hence why Elixir exists at all- Jose Valim was the person implementing this on the Rails core team).

Jose Valim didn't join Rails core until a couple years after Josh Peek (now working for GitHub) made Rails thread-safe.

> And are incredibly difficult to use and pass information back and forth(hence why Elixir exists at all- Jose Valim was the person implementing this on the Rails core team).

It's really not that hard anymore!

results = ['url1','url2'].parallel_map{|url| HTTParty.get(url) }

2012-2013 onwards Ruby got great libraries like concurrent-ruby and parallel that make things a lot easier.

> Which Ruby has a lot of trouble maintaining performantly. When my original team went to use Rails5 sockets, we found we could barely support 50 sockets per machine.

ActionCable is designed for convenience not performance. https://github.com/websocket-rails/websocket-rails will handle thousands of connections per process.

If you need a request/response model (ex: query some data) and not simply queue an operation to execute later without waiting for it. I agree (worker/queue) are the wrong solution. But you should use a multi-language RPC framework like GRPC instead of building a distributed monolith.

With kubernetes you have an endpoint per service that route and load balance to the correct machine.

It seem to me you already got all the benefits from using actor but with less lock-in to a language.

That's even more work? Most developers already have redis/Postgres running, the issue there is the added complexity in complex operations.

Not to mention, microservices are not always the answer. And even if they were, they're still an insane amount of more work than literally changing what functions you call.

I'm not saying you should never use an RPC, but I've significantly reduced the times I'd want to use one. The only reason I even advocate for one now, is because I prefer to empower developers to use whatever language makes them happy, even if I personally greatly prefer Elixir.

Also, building Elixir apps aren’t really the same as traditional monoliths.

It’s really not a distributed monolith as each node can deploy with different code. More importantly actors and supervision trees really help keep projects organized. It’s easy to use PubSub mechanisms or just named actors to communicate between services.

As an example I recently took a project that ran on a single IoT device and moved a chunk of it that managed a piece of hardware to another IoT device connected by Ethernet. It only took moving a handful of files and renaming a few modules. It took longer to figure out why multicast wasn’t working than to refactor the app. There are some limitations with type specs not working as well as I’d like with PubSub style messages (most type checking is done via api functions, not on individual messages).

Calling a remote function is exactly the same as calling a service in both case you are doing an RPC. While doing a Rest service is more work, using something like GRPC or java RMI is not and also support pub-sub mechanism and give you have a clear interface that define which functions can be called remotely which make understanding the cost of the function call and the security implications a lot easier.
> Most developers already have redis/Postgres running

You are joking right? You mean web developers?

Yes, web developers. Elixir is primarily a network systems language, and generally, the Ruby, Javascript, Python, etc communities use Postgres/Redis. Obviously it isn't universal, but there are obvious analogies(MySQL, etc).
I'm a big fan of Erlang, but I think you can acheive similar things in other languages with queues and workers. Erlang's advantage here is that you can do easily do a worker per client connection, for almost any number of client connections; for data processing queues, the lack of data sharing between processes (threads) strongly pushes you towards writing things in a way that easily scales to multiple queue workers -- you can of course write scalable code in other languages, but it's easier to write code with locking on shared state when shared state is easier.

As for distribution, again, this isn't necessarily exclusive, but the right primatives are there and work well to start with. You could have good reasons for a bigger separation between nodes as well.

Erlang has some warts too, of course. For me, the warts are usually about scale, oddly enough. BEAM itself scales very well, but some parts of the OTP don't, often because of the difference in expectations between a telcom environment and a large scale internet service. Two examples:

A) the (OTP) tls session cache is serviced by a single process and in earlier versions, the schema and queries were poorly designed so you could store multiple entries for a single destination, and the query would retrieve all and then discard all but the first. When you were making many connections to a host that issued sessions but didn't resume them, all of the extra data could overwhelm that one process, and resulting in timeouts attempting to connect to any tls host. This was fixed in a release after r18, I believe, to store only one session per cache key, and the cache was plugable before then, but it wasn't fun to find this out in production.

B) reloading /etc/hosts and querying the table it loads into weren't done in an atomic way. I believe this is fixed in upstream as well, but queries satisfied by /etc/hosts were actually two queries on the same table, and reloading the table was done by clearing and then loading, so the second query could fail unexpectedly. This led to the bundled http client getting stuck, despite timeouts.

Workers and queues fail. SQS was down for us for almost two weeks while AWS fixed a bug. We had no choice but to wait or rewrite our implementation... Again! We've already had to rewrite once due to poor visibility and rare occasional problems processing data. Debugging such distributed systems is legendary hell. And that's just for simple async processing so that we can return a response quickly to the user and finish the task in a few seconds. There is simply no comparison between such a complex, failure prone distributed system and the simplicity, reliability, and ease of use of having support built into the language for this, IMO.
I am sorry but I disagree. You are trying to make it sound that your cloud provider downtime has something to do how you manage your workload in your code.

Debugging __any__ distributed system is difficult, this is why monitoring and tracing should be first class citizens in your deployments. It seems they are not for you.

Yeah, monitoring told us it was down and eventually we figured it was an AWS issue we could do nothing about until they patched it. My main point there is actually that for many use cases, this doesn't have to be a distributed computing problem and thus the non-distributed version is superior to the distributed version.