Hacker News new | ask | show | jobs
by jallmann 3726 days ago
> the fundamental architecture of Erlang seems to be in conflict with big data and high speed computing

Because they are different problems. Concurrency, meet parallelism.

Concurrency: many smaller tasks that can be multiplexed over one core. The core doesn't need to be particularly fast; it just needs to be able to handle multiple tasks in-flight at once. Web serving, etc.

Parallelism: one big, honking task that can be split over multiple cores (or nodes), and each core needs all the horsepower you can eke out, so it can finish its part of the task quicker. Big Data, etc.

Erlang is not particularly fast (bad for parallelism), but it excels at concurrency.

> I know people do some big data stuff with Erlang

Not sure who is doing big data with Erlang, or why anyone would want to -- unless they have some very fundamental misconceptions about the problem at hand and the tools available.

> compromise some of the ideals espoused in this article to make it work

The article was really nonsense. Most of the terms it throws around have nothing to do with Erlang specifically, and proclamations like Erlang being used for a hypothetical "data center on a chip" are... what? The author's gist basically boils down to this: Use supervision trees with unikernels. I have had the same fantasy, for what it's worth.

3 comments

I know you weren't aiming for maximum accuracy in your definitions, but I think this is worth bringing up.

Parallelism and Concurrency are not mutually exclusive terms. All parallelism is concurrent.

Concurrency is whenever more than one thing is happening at the same time conceptually. Everything from iterators to threads.

Parallelism is whenever more than one thing is happening at the same time physically. From a user-space perspective, this means threads or a coprocessor (like a GPU).

Yes. The intuition is a good one of parallelism being a deliberate way of designing a system to run its parts simultaneously (physically), with concurrency being a property of a system that may or may not have its parts running simultaneously (conceptually, 'overlapping', whether physically or logically).

There are many, many ways to think about this difference, and it's fun (and beneficial!) to do so every once in a while.

> From a user-space perspective, this means threads

Interestingly enough, before SMP and multicore, thread-level "parallelism" was actually disguised by a time-sharing concurrent implementation. That remains true for most threading libraries in languages with a GIL, and any time you have threads exceeding the number of physical cores. In fact, the primary purpose of threads was (and still mostly is) to get a semblance of concurrency... Even the Erlang implementation was single-threaded 2008 or so.

Well obviously threads aren't always running at the same time since each cpu can do only one thing at a time (or a finite number, if we're counting hardware threads) and you can have more threads than cpus.

It's the fact that they might run at the same time. Also from the perspective of the programmer, there's no difference between two threads running at the same time or by time sharing, since preemptive multitasking is non-deterministic and has the same implications as true parallelism.

I think we're talking about different things? You seem to be referring to parallelism in the literal, general sense of the word, while I'm referring to computational parallelism. Basically, if Amdahl's Law doesn't apply, then you're looking at concurrency. So this statement

> from the perspective of the programmer, there's no difference between two threads running at the same time or by time sharing

is incorrect when you're talking about computational parallelism (as OP was), because you're not going to realize any speedups with time sharing. In that case, you're using threads as a concurrency mechanism -- not for parallelism.

Not sure who is doing big data with Erlang, or why anyone would want to

Nokia created an open-source Hadoop-replacement called Disco [0] that used Erlang for coordination/orchestration -- an underappreciated strength of the language -- of map-reduce jobs, where the jobs were written in Python (and later OCaml, etc.). They've shown that it can handily outperform Hadoop (at least in the canonical wordcount example shown in this talk[1] -- there may be other examples, I haven't actually watched the talk yet). They've used it to mine terabytes of logs, daily, as described in this talk[2] and others apparently have used it as well.

From the abstract[3] describing the first talk, about the project:

We will describe our experiences using Erlang within Nokia to build Disco, a lean and flexible MapReduce framework for large-scale data analysis that scales to large clusters and is used in production at Nokia. Disco is an open-source project that was started in 2008 when attempts to use Hadoop to analyze data proved to be a painful experience. The MapReduce step formed only a portion of the analytics stack, and it was felt that it would be faster to write a custom implementation that would integrate well, than adapt Hadoop with the amount of internal Hadoop expertise available. Among the crucial tasks of such an implementation would be to deal with cluster monitoring, fault- tolerance, and the management and scheduling of a large number of concurrent and distributed jobs. To keep the implementation simple, the use of a platform that provided first-class support for distribution and concurrency was imperative. This motivated the choice of Erlang/OTP to implement the core control plane of Disco. It bears stressing that this choice was driven primarily by pragmatic concerns, as opposed to any beliefs about the superiority of functional programming languages in general or Erlang in particular.

The project's homepage [0] has information, a link to its Github, etc.

[0] http://discoproject.org/

[1] https://youtu.be/IjOGUC-iR_Q

[2] http://vimeo.com/23550705

[3] http://cufp.org/2011/disco-using-erlang-implement-mapreduce-...

> Not sure who is doing big data with Erlang, or why anyone would want to

Riak?

Are people actually running heavy analytic workloads using Riak map-reduce?

This assumes a pedantic qualification of "big data" handling -- does storage count, as opposed to the actual processing of the data? But I think that's an important distinction in this context, as it strikes the heart of the dichotomy between concurrency and parallelism.

(Moreover, I've never really been sure how much Riak is being used for actual 'big data', as opposed to being a master-less, highly available repository for 'regular data' (distinction between 'big' and 'regular' deliberately left vague). More so since Riak is key-value as opposed to columnar, but I suppose that depends on your workload. But that's all besides the point here.)