| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by babahoyo 2864 days ago
	Julia is increasingly becoming better than R and Stata for data cleaning. Many of its metaprogramming tools beat `dplyr` in syntax and features. So if the data-cleaning to regression stack (which i would guess is different than scientific computing) is your thing, then i would recommend trying Julia out.

2 comments

zzleeper 2864 days ago

Early on (as a heavy Stata and Python user) I tried Julia and got quite discouraged by its messy treatment of missing values (and weights, etc). I've also tried R but also found lots of inconsistencies, so not enough reason to switch, besides when plotting nice graphs.

But I would say Julia is increasingly getting there. Comparable packages are WAY easier to write in Julia than in Stata/Mata, while being faster, so any gaps will keep disappearing in the next hears.

link

babahoyo 2864 days ago

you were somehow satisfied with the way stata handles missing values???

    gen x = y if z > 4 // headaches abound

Julia's missing value support is great now and is only going to get better. You have to be more careful with how you use them, but you won't get anything like the output above in julia.

* For reference, stata uses +Inf as missing value, so any operation with "greater then" is going to assign missing values to something. And yes, there have research papers retracted due to this behavior.

link

cuchoi 2863 days ago

One of my least favorite quirks of Stata

link

skybrian 2864 days ago

Did you see how Julia does missing values now?

https://julialang.org/blog/2018/06/missing

link

kirillseva 2864 days ago

Could you give some examples of how dplyr-based data cleaning code would look in modern julia?

link

babahoyo 2864 days ago

Check out DataFramesMeta, which unfortunately isn't working on 1.0 yet. They have basically a 1-1 matching of `dplyr` verbs to julia versions.

I don't think a standardized and idiomatic data-cleaning process has been established yet, which is for the best right now. There is `JuliaDBMeta` for metaprogramming with JuliaDB tables, and the `Queryverse` for working with a wide array of objects.

One way that Julia's metaprogramming shines is with the ability to go into the AST and replace symbols, enabling local scopes that are more readable than other scopes. One workflow I'm excited to experiment with is something like this

    @as my_long_dataset d begin # make d = my_long_dataset in this scope
    @with d begin 
    t = :x1 + :x2 + x3 # these symbols are arrays inside this @with scope
    d.new_var = t # assign the variable
    end
    end

Of course, with the `@as` macro you probably don't save that many keystrokes if you are just doing `d.x` or `d[:x1, :x2]`... The ecosystem is still evolving but the point is that I like how you can replicate something like `attach` scoping in R without all the headaches. I think it makes a cleaning script feel more like you are only working with the data you care about.

link