Hacker News new | ask | show | jobs
by babahoyo 2864 days ago
Julia is increasingly becoming better than R and Stata for data cleaning. Many of its metaprogramming tools beat `dplyr` in syntax and features. So if the data-cleaning to regression stack (which i would guess is different than scientific computing) is your thing, then i would recommend trying Julia out.
2 comments

Early on (as a heavy Stata and Python user) I tried Julia and got quite discouraged by its messy treatment of missing values (and weights, etc). I've also tried R but also found lots of inconsistencies, so not enough reason to switch, besides when plotting nice graphs.

But I would say Julia is increasingly getting there. Comparable packages are WAY easier to write in Julia than in Stata/Mata, while being faster, so any gaps will keep disappearing in the next hears.

you were somehow satisfied with the way stata handles missing values???

    gen x = y if z > 4 // headaches abound
Julia's missing value support is great now and is only going to get better. You have to be more careful with how you use them, but you won't get anything like the output above in julia.

* For reference, stata uses +Inf as missing value, so any operation with "greater then" is going to assign missing values to something. And yes, there have research papers retracted due to this behavior.

One of my least favorite quirks of Stata
Did you see how Julia does missing values now?

https://julialang.org/blog/2018/06/missing

Could you give some examples of how dplyr-based data cleaning code would look in modern julia?
Check out DataFramesMeta, which unfortunately isn't working on 1.0 yet. They have basically a 1-1 matching of `dplyr` verbs to julia versions.

I don't think a standardized and idiomatic data-cleaning process has been established yet, which is for the best right now. There is `JuliaDBMeta` for metaprogramming with JuliaDB tables, and the `Queryverse` for working with a wide array of objects.

One way that Julia's metaprogramming shines is with the ability to go into the AST and replace symbols, enabling local scopes that are more readable than other scopes. One workflow I'm excited to experiment with is something like this

    @as my_long_dataset d begin # make d = my_long_dataset in this scope
    @with d begin 
    t = :x1 + :x2 + x3 # these symbols are arrays inside this @with scope
    d.new_var = t # assign the variable
    end
    end
Of course, with the `@as` macro you probably don't save that many keystrokes if you are just doing `d.x` or `d[:x1, :x2]`... The ecosystem is still evolving but the point is that I like how you can replicate something like `attach` scoping in R without all the headaches. I think it makes a cleaning script feel more like you are only working with the data you care about.