Hacker News new | ask | show | jobs
by babahoyo 2796 days ago
Getting people to switch to R (or Julia) from excel is to balance two competing goals.

First, you need to establish that scripting is far superior to point-and-click interfaces via reproducibility and legibility. This is the fight worth having, as it gets to the core of what it means to do science and share results.

Second, you need to convince them that they don't have to sacrifice the tractability and ease of use of excel. Jupyter notebooks, for all their faults, really sell this idea well. But the first goal should dominate the second. Nteract and shiny are great, but I they are big tools that are difficult to teach beginners to code up. We shouldn't say "use R because you can still use point-and-click interfaces", we should argue against point-and-click interfaces all together in this context.

I also agree with the poster below that running a script again with a new parameter changed is super super easy, and you dont need a GUI to explore data like that.

3 comments

It may be useful to take people's actual behaviour into account, instead of starting from the assumption that they must be idiots.

As it happens, no tool has been more successful at making this idea of "everyone being a coder" than the spreadsheet. It's an excellent (pun intended) gateway drug, with low barriers of entry, and a value proposition that was obvious enough to get accountants to spend 5 figures on a Mac when computers were something completely new.

Sure, the language is clunky. But it's apparently enough for most peoples' needs. I can't immediately think of any reasons for this almost moralistic criticism of Excel.

It seems like it would be a weekend project to convert Excel worksheets to sequential code in a universally-readable textfile, plus a separate CSV for the data. That would make sharing and reproducibility just as easy as easy as getting them to switch out their complete toolchain. But I have doubts that unwillingness to share code & data is actually a technical problems, instead of a lack of incentives/embarrassment of sharing ones' awful code/dreams of patents future/privacy issues.

I'm also not convinced that Julia as I have experienced it would be any better for long-term reproducibility. Last time I tried Julia, about every second package had a specific version that it required, and very few of them shared their preferences.

I think there is a lot of reason for moralistic criticism of excel.

> That would make sharing and reproducibility just as easy as easy as getting them to switch out their complete toolchain

Reproducibility is about reading as much as it is getting the right number. Excel hides the operations from the reader, so that there could be a cell X182that has the key operation that you never area able to trace back.

https://www.bloomberg.com/news/articles/2013-04-18/faq-reinh...

But I totally agree that an excel spreadsheet is the most popular and easiest to learn programming language on the planet. That's why I mentioned ipython notebooks above. Excel is great because you can enter something and immediately see your results in front of you, which is something that jupyter does as well. But we definitely need to draw more from excel's UI and workflow if we want people to switch to those tools.

> But I have doubts that unwillingness to share code & data is actually a technical problems, instead of a lack of incentives/embarrassment of sharing ones' awful code/dreams of patents future/privacy issues.

That's true. But at the very least internal use of plain text and git would be immediately helpful for collaboration and project management (branching, code reviews etc.). I think you underestimate how unwieldy a project like this can get very quickly.

> Last time I tried Julia, about every second package had a specific version that it required, and very few of them shared their preferences.

The landscape is still settling down after the 1.0 release. I share the concern about particular version numbers, but notice that basically all packages are on 0.X in SemVer. When the ecosystem makes it to 1.0 as well things will improve dramatically.

Did you ever tried org-mode with org-babel? If not only see few video about org-mode reproducible research / emacs on YT&c :-) in terms of reproducibility Nix{,OS} and Guix{,SD} offer really nice tools to achieve full-stack reproducibility without the need of publish multi-Mb/Gb archives, only a text file .nix or scheme can do the business.

Also ESS inside Emacs it's another nice working environment, perhaps is not as "natural at first impact" like a spreadsheet but I bet NO ONE after a week of (self) training will found spreadsheet better or even usable.

I think this is missing the forest for the trees. The audience for these tools is people who have never written in a text editor in their life! They literally might not even recognize that some fonts are fixed-width and not know what textedit is.

That is to say, emacs might be over-kill. But literate programming is a pretty powerful concept in general, and rmarkdown and Rstudio (along with simliar tools in Python, Julia, etc.) make this super easy, and I think its a great way to introduce people to programming.

Getting people to use Git is the fight worth having imo, but its still hard. I was talking to someone the other day who didn't like git because tough to have comments about the work: they didn't use code reviews on github to get consensus, just comments in the code.

Plus there is the whole data-management problem. Quilt, SQL connections, and DataDeps for Julia are good solutions, but there isn't a single one answer that we've coalesced behind.

These are all really tough problems. I'm actually really interested in MS's acquisition of github for this reason. Maybe they can put some money towards the obstacles that have prevented mass adoption of scripting and git.

IMVHO anyone must know the tools of their trade (I do not know if this make sense in English, hope so, maybe also expressed as "the tools used for doing their job"). If today statistic is done with computers good computers tools must be known. Newcomers of course can't know by genetic, they have to learn and universities must teach them so...

Consider that org-mode itself was designed, written and still maintained by an astronomer, one of the most ancient and widespread completion framework for Emacs (Helm) was written and still maintained by a mounting guide that still be a mounting/climbing guide. Of course they are "exception" but they are simply people who encounter "the right tool" at a certain point in time and start to learn it.

On GitHub, for me as any proprietary platform should be ignored, at maximum used to share git repos, certainly not for PR&c that are proprietary unportable stuff. And that's another thing anyone that use a computer for more than play a game MUST know, from office guys with their email only on webmail and smartphones user with their "valuable" personal data (photos, video etc) to IT pro.

I disagree that scripting XOR point-and-click. Why can't we manipulate a dataframe using both approaches?
If the pointing and clicking generates code that can seen and stored, then fine. A reproducibility problem comes from the fact that most point-and-click tools don't. As a computational biologist who uses R I am often frustrated by experimental colleagues who use Excel and often can't remember how they transformed the data months later. I don't necessarily have better memory but I can go into my script and look.
In scripting languages, writing and sharing a new function isn't hard. And, if the language is open source, that encourages the rapid development of a community which implements useful and state-of-the-art tools.

If every tool needs to be available through a GUI, then, as far as I can see, that GUI will either be a burden on creators or so generalized it's no better than tab completion.

I'm worried about seeing a script that says

    df = read_csv(file)
    df <- df %>% mutate(log_income = log(income))
    # do manipulations in shiny
    Shiny(df)
    # click "transform" in the top right hand corner...
Point and click is fine for EDA, but it’s inherently manual.

Repeating those manual steps gets boring very quickly, and is also error prone.

You can try and have the GUI generate code, but automatically generated code is awful.

We have had few point&click text-based systems in the past, notably Plan9 and Xerox Alto and only the letter have a complete "user friendly graphic programming environment"... I think a simpler approach like org-mode/org-babel is the best we have now out of the box.