|
|
|
|
|
by nikonyrh
2137 days ago
|
|
I have used both professionally at a senior data scientist role so I feel like pitching in. Perhaps due to my background coming from Matlab I never got too keen on dataframes (be it Pandas or whatever Clojure has to offer). Instead I use matrices for homogenous data or whatever hashmap-of-list-of-sets describes more complex data. When your data is already in a CSV format and you want to do basic analysis on that or fit mathematical models I highly recommend the Python / Numpy / Pandas / Scipy combination. It can be easily extended to which ever direction you want to go, be it PySpark or Keras. Clojure taught me a lot about infinite lazy sequences (kinda like Python's generators) and how to model the program as a pipeline. A good analogy is found from shell programming. There you have stand-alone programs which handle individual tasks and you can pipe previous program's stdout into next program's stdin. On Clojure you'd wrinte stand-alone functions which you "pipe" together via "->" thread-first and "->>" thread-last macros. It also ships with several handy functions such as "frequencies", "group-by" and "partition-by". I have ported these and several others to my own Python projects thanks to their versatility and a kind of universality. Oh and speaking of macros, if you want to get fancy you can design your own domain-specific-language and express your problem in that, hiding all of the poilerplate under the hood. But to get the highest performance sometimes you need to think whether to use Clojure's immutable datastructures or resort to Java's mutable ones, which could have better performance (or use a library I guess). Well at least on JVM you can do "real" parallel programming, unlike on CPython interpreter due to the GIL. Clojure is fun and very educative for all kinds of projects, but on a professional data analysis setting I'd start with Python and if it seems like a bad fit then do a PoC with Clojure. :) What a huge topic. |
|