Hacker News new | ask | show | jobs
by SatvikBeri 3394 days ago
The most direct reason is because the current team enjoys functional programming.

From a business standpoint though, there are a few main reasons:

–Data pipelines are well modeled as functions: they take a few input datasets, return a few outputs at the end, and do a ton of processing in between

–FP idioms generally make parallelization easier, and this is very important for the datasets we're dealing with

–A strong type system like Scala's lets us prevent many runtime errors, which is quite important when your pipelines can take several hours

–It's fairly trivial to wrap a statistical/ML algorithm in a pure functional interface, even if the algorithm itself is imperative

2 comments

Have you had performance issues getting things to conform to functional paradigms?

For example i've found that as a pipeline gets optimized for production use it needs to preallocate all of its output space and then modify things in at each step (like a one hot encoder flipping a few bits in specific rows of a zeroed array instead of allocating new ones and copying them in).

I find it difficult to reconcile this sort of code with a "pure functions without side effects" philosophy and still have it perform an an acceptable level.

We're mostly doing ETL on large datasets, so the code needs to parallelize well, but beyond that performance isn't really a big concern. We use ML in research, but no models in production, because the costs of increased maintenance/lost transparency generally outweigh the benefits in our use case.

In jobs that were heavy on ML, I would use high-performance tools for the models (imperative code, numeric computing packages etc.) and functional code for the ETL, which worked pretty well–no need to be dogmatic about it, a 70% pure codebase is still generally easier to reason about than a 20% pure codebase.

Interesting I will have to have a proper look at Scala when I get my baby cluster up and running.