| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jfim 1881 days ago

> Paralellizing data processing is trivial, it's literally 2-5 more lines of code in many scenarios. I've tried -- and failed -- to replicate what Elixir does in this regard, in several other programming languages.

Scala's parallel collections handled this cleanly. I believe they've been removed from the built in libraries (still available as a JAR though), but the idea was that you'd just add a .par call in your data processing chain and it would use a parallel collection instead of a regular sequential one.

These examples from the documentation (https://docs.scala-lang.org/overviews/parallel-collections/o...) show how to turn a regular sequential computation:

  val list = (1 to 10000).toList
  list.map(_ + 42)

Into a parallel one

  list.par.map(_ + 42)

Oftentimes, for data processing, being able to parallelize parts of the computation makes it fast enough that one does not need to go beyond a single machine. Spark's RDDs are basically a distributed version of Scala's collection library.

One of the areas where Elixir and Erlang shine is distributed applications.

For anyone who has written socket code, the ability to easily send regular Elixir data structures as messages to processes that may run locally or remotely is pretty awesome. And once you realize that you can also send closures over the network (ie. one node can send a snippet of code to another node and it'll be executed there), mind blown.

1 comments

pdimitar 1880 days ago

Very glad that Scala offers this!

Do you have any insight on the underlying thread pool? Can you control the size? Is it automatically determined?

Erlang/Elixir default to CPU threads but the number of parallel thread schedulers can be manually changed as well.

link

tn1 1880 days ago

This talks about how to configure it: https://docs.scala-lang.org/overviews/parallel-collections/c...

link