Hacker News new | ask | show | jobs
by kyllo 1164 days ago
Elixir runs on the Erlang VM, which is designed for distributed multi-process concurrency, right? Why would this be advantageous for interactive data analysis work, which is typically done on a single node? I don't quite understand the use case, hoping someone can explain.
3 comments

Uses cases that comes to mind are data distributed across nodes and things like distributed training of machine learning models (which is getting more and more focus as models get bigger).
Hi, maybe I can try to answer your question.

First, to makes ure we are on the same page, distribution in Erlang happens across nodes and concurrency happens within a single operating system process. Erlang calls its concurrency primitive processes (because they are also isolated and preemptive) but that can cause some confusion (hence this comment).

From now on, when I mention process, I mean the Erlang VM processes, and they are very lightweight and you can create millions of them.

I can think of a few different ways where concurrency can help interactive data analysis:

1. Livebook supports rich outputs where each output is a process. This means your notebook can communicate with outputs as it executes. For example, it is very easy for you train a neural network and push data to the graph as it comes. Or to process data and plot it as you go.

2. You can use concurrency to run several experiments at once within the same notebook. We support this in Livebook via "Branched sections". You can prepare the data and then start several branches/processes to digest the data in different ways without a need to start several notebooks.

3. It is also very easy to build applications where multiple users can collaborate and interact with it, which we showed yesterday: https://news.livebook.dev/build-and-deploy-a-whisper-chat-ap...

When it comes to distribution, it is quite similar to above, because the concurrency and distribution primitives in the Erlang VM are the same. Here is an example of how easy it is to take a ML model from concurrent to distributed: https://news.livebook.dev/distributed2-machine-learning-note...

Generally speaking, I think we should start from the opposite side: we should try to make everything concurrent by default and fallback to serial only when we cannot. Specially for data analysis, where moving data is expensive, we may end-up incurring a lot of overhead if the only form of concurrency is via the network or inter-process communication.

One last note, perhaps the most important bit of the Erlang VM machine for data analysis is that it favors a functional style. Livebook notebooks are strongly reproducible. I expand on this in this video: https://www.youtube.com/watch?v=EhSNXWkji6o

I hope this helps (and feel free to tell me if I missed the mark!).

PS: I know many of the videos above are machine learning related and that's because we have started our data journey only now. Although the principles should generally apply. Hopefully more data videos will come soon! :)