Hacker News new | ask | show | jobs
An alarming number of scientific papers contain Excel errors (washingtonpost.com)
91 points by pns 3589 days ago
16 comments

MS Excel is absolutely unfit for most scientific and engineering problems.

The spreadsheet GUI, lack of good version tracking/history, and eagerness to coerce data types and "correct" values makes it easy to introduce errors that will go unrecognized and propagated through calculations. Unfortunately this story just keeps repeating itself.

But all of this is just a secondary concern to Excel's real trouble: it's history of incorrectly implementing numerical and statistical procedures. One could plumb the depths of this topic for hours, but here are a few highlights: regression formula accepts illegal/nonsensical inputs (e.g. collinear predictors) and gives illegal/nonsensical outputs [0], variance/standard deviation change incorrectly with sample size [0], output of a paired t-test changes when missing values are included [0], formulas are mislabeled [0], v. 2007 gives very wrong answers to 11 of 27 tests in the NIST test suite used for statistical software benchmarks [1], the random number generator was broken as late as v. 2007 [1], and calculations relying on any of 12 particular floats display an incorrect result [2]. There are plenty of other issues mentioned in the links and elsewhere; if you're interested you'll have no trouble finding them.

Remember, friends don't let friends use Excel for science. :)

[0] http://people.stern.nyu.edu/jsimonof/classes/1305/pdf/excelr...

[1] http://www.pages.drexel.edu/~bdm25/excel2007.pdf

[2] https://blogs.office.com/2007/09/25/calculation-issue-update...

Edit: clarify and add a new issue I became aware of while researching further.

I'll just quote Soustrup here: "There are only two kinds of languages: the ones people complain about and the ones nobody uses." In particular you see the spreadsheet GUI as a downfall, but I think it is a great enabler that has allowed millions of non-programmers to create useful programs.
I would counter with "use the right tool for the job."

Excel is great at churning out fast and dirty estimates for low impact work. The problem is when it's used for large, complex, or important problems because these are just not Excel's domain--something obvious when looking at the kinds of features and bug fixes MS has prioritized over the years.

I think the Excel team disagrees with you. I remember having even seen an Excel demo about big data on channel9 (apparently if it fits in a spreadsheet, it is still big data!).

Based on the evolution of Excel, they clearly see it as a data analysis tool only, with lots of iterations of the pivot tables functionality. And clearly not as a modeling tool.

My guess on Excel usage is that it is 10% of the time used for data analysis (pivot tables, time series analysis), 20% of the time used to create a simple table (planning for the week by non technical people), and 70% of the time used for modelling (business plan, tax calculations, accounting, pricing financial instruments, running an inventory, calculating something scientific/mathematic, etc). Most companies are run on Excel.

For instance one thing where Excel sucks at one of its core functionalities is linking a powerpoint presentation to an Excel model. Any consultant, expert, banker, accountant, salesperson or marketer will do that all day. You can paste a table as a linked image but the formatting will be completely unstable and it is very dodgy. And no way to link a number in a textbox to Excel. Microsoft should focus on these problems rather than yet another pivot table functionality.

Though office seems to be frozen time. I can't think of any major new feature since Office 2007.

It is damning that you cannot use a basic computer tool for anything that is large, complex, or important. Isn't that exactly what computers are for? Call me idealistic but I want a spreadhseet that can handle milions of rows. Maybe add a "strict" mode to keep people out of trouble.
You could have an option to disable any formatting in a tab, and then Excel should be able to handle a billion rows in that tab. But you can't have all the flexibility of Excel on a massive scale. Even a laptop today is insanely fast and has an insane amount of memory.
PowerPivot in Excel does allow you to have millions of rows in tables - mind you that isn't in the standard spreadsheet functionality, but it is there.
Excel is used HEAVILY at all levels of business for large, complex, and important things. I'm curious what this right tool is that they should be using instead.
Python with a good DAG library, e.g. https://github.com/manahl/mdf (but there are plenty of others kicking around).

What this lets you do is in Python build networks of computations then change one value and it will cascade through the rest of your model, exactly like you do in Excel, but in actual code that can easily be versioned, reused, etc.

Java, for example.

Yes, it's true Excel is used for a lot of big important things, but I'd agree with GP it's not very good at them. Moreover, I think most people who use Excel for big important things, would agree it's not good for it--they get stuck in that situation when a small spreadsheet grows big over the years. Lack of version control, limited modularity, and no ability to unit test make large spreadsheets error-prone, and Excel tends to barf on you when you push it too hard (seems to be a number of crash bugs and performance bugs that never get fixed).

Excel is incredibly productive for quick analysis and making charts. But it's really not a development platform and it's really not for heavy lifting.

The primary reason Java (and other programming languages) aren't, and won't be used instead of Excel is that Excel is interactive. About the closest you can get to that with a programming language is Lisp, but it's still half-way there compared to a spreadsheet, when you can see the data all the time as it is placed and updated.

The second big reason is that Excel just works, and all programming languages usually require installing tons of bullshit - editors, toolchains, whatever - to be able to work somewhat conveniently in them, and the moment any link in the bullshit chain crashes, you're SOL if you're not a developer. Yeah, we sometimes underestimate just how much minutiae knowledge we have that allows us to fix random problems in our tools without even thinking for a second about it.

So maybe Excel is unfit for purpose, but it's still the most fit for purpose tool available.

Using Java probably won't help you with the numerical issues though, unless you happen to use a good statistical library.
In fact a lot could be said of the millions of ways to shoot oneself in the foot with programming languages. Excel being visual (the formula is evaluated/tested immediatly after being entered), I'd argue a diligent user will make less mistakes with Excel than while coding.
This is true for small spreadsheets, and one reason why I love Excel for quick analyses. Debugging is nearly instant because you see the results as you go along. But it doesn't scale very well.

--If you want to re-use a formula, you basically have to copy-paste it. Generally there isn't much modularity. Programmers are well aware of the dangers of this.

--It's also not a friendly platform for writing regression tests (possible with VB, but who does that?).

--The two-dimensional nature of the spreadsheets and lack of loops means you often need to do hacky stuff to emulate multidimensional arrays or loops. Easy to make mistakes here.

> Excel being visual (the formula is evaluated/tested immediatly after being entered)

We generally call this "interactive" and not "visual", in order to avoid confusion with graphical programming languages.

I agree that being interactive probably helps rapid iteration, debugging, and experimentation. However, Excel could still be interactive without being so fast and loose with types, accurate display of numbers, and many of the other faults others have pointed out.

Also, if you're trying to make some subtle point about the reduced severity of mistakes, you mean "lesser" rather than "less", otherwise you mean "fewer". https://duckduckgo.com/?q=less+vs+fewer

If the programs contain errors, they are not useful. They appear to be useful to someone who just wants to see some activity instead of actual work.
All programs contain errors, and almost all output of calculations is approximate.

The question is if for most of the use cases the margin of error is just fine...

I can assure you that most "traditional" engineering (thinking mostly mechanical and civil) calculations, for things you use daily, were done in excel. Anything that would be a pit too much of pain to manually bang out on a calculator, but that doesn't really require complex logic (looping etc)? Excel.

Most engineers are more similar to business people in that regards, they see traditional programming languages as too complicated. We learned MATLAB in university, two classes, basic intro to programming and then numerical methods. Many people had to retake the first class at least once.

I'd love to bang out my calculations in IPython Notebooks, but the most important requirement of engineering calculations is that they can be documented and understood by peers. Since none of my peers are interested in learning python, it's useless. Excel is the lowest common denominator, everyone gets it.

I once read this interesting contrast on HN and it stuck with me:

"Traditional programming languages show the program but hide the data. Spreadsheets hide the program but show the data".

At university, we were forbidden from using Excel, and only FORTRAN 77 was allowed. At the time, this seemed strange and anachronistic, but when you consider that it had been in use for 25 years at the time, and was a known quantity that didn't have strange and unknown bugs, it actually seems sensible in retrospect.

I do however recall being mightily annoyed that I was having to calculate ephemera in "arcane bullshit", so I can understand why many shortcut to excel.

Experience makes fools of our past selves.

A company I worked for banned Excel for analysis and made everyone use Ipython notebooks because the analysis was transparent and very easily repeatable.
You can share your Excel files and macros with other people too. Python notebooks are kind of cool but if you need something quickly to sift through the data to have an idea, Excel is an amazingly good tool for it
I'm an Ecology student, and the last few days there's been a lot of (justified) criticism of Excel. I use R and Python for data analysis, but a lot of what I do is gathering data, inputting into Excel and then exporting this as a CSV for use in R/Python for statistical analysis - are there any other tools that can replace Excel for this step?
How do you gather the data? You could learn scraping in python, for example. If you manually input the data, a simple spreadsheet is fine for that.

If you have the data in a non csv format and use excel to transform it into csv, python and R have utilities for that job (python has pandas, I don't use R as much for data cleaning, but I know it can).

Yeah, it's a lot of manual entering of data (eg finding abundance of certain species of lichens, and lat/long, then manually entering all data and exporting this as a csv either for heatmaps with GIS or for more stats analysis and the same with animal species abundance) so it's good to know I'm not missing something fairly obvious!
Using unfit programs is a hard problem. Remember this thread?

- Ask HN: Suggestions for spacewalk procedure writing?

https://news.ycombinator.com/item?id=5585535

(It is asking for alternatives to Word for EVA planning)

.... and then we get into the whole realm of simple user errors - alarmingly easy to commit in an environment that positively encourages the production of a rats' nest of essentially unauditable calculations - then then feed into important financial decisions.
I have to agree, Excel and science just don't go with one another.
First I thought this story was about math and numerical errors. But it's actually about auto- formatting and auto-correction.

"Excel automatically converting gene names to things like calendar dates or random numbers"

In this case, I think what is needed is some kind of rudimentary knowledge of data-types. Or perhaps more simply a scientific template which is actually plain text by default.

But how are people not noticing auto-correction and auto formatting taking place!

The only perfect solution is to hire a developer to build you a data entry system. The developer can build the system which they have no cause to entirely understand the science behind, and thus a human to take the blame for errors instead of excel.

The worst with these formatting is that Excel will behave differently in different regions. Another headache in large international corporations/collaborations.
But that's an advantage too. As a non English-native user, I like when companies care about my language and culture and they do not stick dates to the MM-DD-YYYY format that is used by Americans only. Yes, of course, we should all go for YYYY-MM-DD... fine by me, but now you go telling my mom :¬)
I've found it hard to get Excel to turn off the coercions. In telecom tons of inter-company data is exchanged in Excel or CSV files. And very often numbers show up in scientific notation. Or timestamps (10.2 minutes) get coerced into dates. It's infuriating. Sometimes the manual formatting to no format works and sometimes it doesn't. I feel dumb admitting that I can't seem to figure something so basic out.

Excel should really offer an easy way around this.

My experience is the same, also in telecom/networking. One of our vendors' serial numbers, when entered into Excel, get converted to something else entirely (not sure what exactly), and setting the fields to "text" or "numbers" doesn't seem to do much, consistently.
How people don't notice: you have a list of thousands of gene names. It's impossible to be familiar with all of them, so it's hard to notice if one of them gets auto-corrected to something else.

The SEPT9 gene is problematic enough to be memorable, though. https://en.wikipedia.org/wiki/SEPT9

https://help.libreoffice.org/Calc/Deactivating_Automatic_Cha...

Type apostrophe at beginning of the gene name ('MARCH1) or format the column for gene names as text (click column letter, then Format | Cell and select text)

If people want to use a spreadsheet application for this kind of data collection (and that is a big if I think) then they perhaps need to have some agreed lab protocols for setting up and checking the spreadsheets. This is a known issue in financial circles...

http://www.eusprig.org/basic-research.htm

Ah, you beat me to EUSPRIG. A fine organisation whose research findings give me the heebie-jeebies.
When my Ph.D is finally done (~3 months), I'll post some of the code I've had to work with daily for the past three years.

"Spaghetti" doesn't even begin to describe it. "Ball of yarn under a cat-lady's sofa" comes readily to mind, as does gouging my eyes out and amputating my fingers.

The problem isn't excel. The problem is scientists.

The way I got people in my lab to start complying with some best practices is by linking everything back to bench top work. It would be unacceptable to run in vitro experiments the way people write software in science. You need to run controls (unit tests), you need to write your procedure down in your lab notebook (comments, version control) etc
During grad school, I found that MATLAB, in particular, lends itself to spaghettification.

I can think of a lot of reasons why that is, but the one-function-per-file and single flat directory structure of MATLAB programs is part of it.

Language quirks are another, but I could write an entire book about that.

>I found that MATLAB, in particular, lends itself to spaghettification.

How did you guess? ;)

I've been gently pointing people towards python for this exact reason. The younger generations need little convincing, but the old dogs would rather write the same shitty code.

I suppose change takes time.

Python has definitely been a much better way forward for my work!
Why is the auto-convert 'features' in Excel not opt-in?

When Excel encounters the first cell in a new sheet that it thinks should be auto-converted, why does it not ask if that is desirable for that sheet?

Like: "Do you want Excel to interpret and auto-convert all strings with format <X> into the type <Y> in this sheet?"

At least for conversions where the original data is lost.

I don't think Excel ever destroys the underlying data. It simply layers formatting over it. One can override by specifying custom formatting, or opt out by selecting all cells and setting their format to text.
Once the data is in the cell, changing cell formatting is a nondestructive visual layer. But while inputting data, cell formatting changes the interpretation of the input, which is destructive.

The article describes in some detail how inputting SEPT2 in a cell with default formatting displays 9/2/2016, but is stored as 42615 (which you get if you later change the cell to text formatting).

If you enter a string whose first character is a "0" and Excel decides to change it to a number, the "0" is lost.
Commas are an exception to that. If you paste in "10,100", it will be interpreted as a number, and you lose the comma.
Highly relevant: Felienne[0] Hermans' compsci research on spreadsheets out there in the wild, and how to develop software engineering tools to make them better:

https://www.youtube.com/watch?v=2Cdgew5zvI4

http://www.felienne.com/archives/tag/spreadsheets

[0] pronounced Fay-lee-nuh

Same thing hapened in Economics involving a major figure in Economics.

http://www.bloomberg.com/news/articles/2013-04-18/faq-reinha...

Yes, they did not get what the world "hide" means in English...
It's funny to consider that these errors slipped past the peer review stage. It really highlights the major issue with reviewing source code published as part of an analysis.

If there aren't enough resources / skilled eyes to catch these simple errors, what are the chances they would catch errors in source code too?

How many of the studies are doing anything statistically interesting though [ie new/different in the field of statistics]. In most cases it should be possible to have an analysis program, perhaps specialised to the field, in which you simply load the data attached to the study and look at however many regressions or time series or whatever you want - they're standard tools, surely you can download the data and put it in a GUI and look at the same graph as is in the paper in 2 or 3 clicks?? If not why not?
Anyone doing serious statistics uses SPSS or R, or similar stats programs. If not, you deserve all the bad data you get. Using Excel for that is akin to using a point and shoot camera for a fashion photoshoot, or a crossover car offroad in Death Valley.
Consider that given the poorly conducted statistical analyses, p-hacking etc that goes on in the life sciences, Excel garbling gene names might actually improve the net accuracy of the results by removing false positives.
Is everybody a washington post subscriber? Or have I missed the route around the paywall somehow?
I read all my WP articles via right-click "Open link in incognito window".

(OT now:) If anyone thinks that's cheating them out of money - the choices are not "read for free" and "read for the appropriate price", the choices are "read for free" or "don't read". The reason I read is to procrastinate, so the value I get out of it is actually negative (same with HN...). Even their occasionally excellent articles (like their series on asset forfeiture) is stuff I'm at most mildly interested in (as a foreigner) to distract myself. That's my issue with today's media, I don't actually feel "informed" as in "it this is good for my life that I know these things". I can't do anything about 99.99% of the stuff I read about anyway, nor is it a representative sample of reality but consists almost solely on reporting the outliers.

>The reason I read is to procrastinate, so the value I get out of it is actually negative (same with HN...). //

Entertainment has value doesn't it? It's not a simple financial value like the simplistic use of opportunity cost as being the price you could bill those hours at, but it's a useful and functional part of being human.

I don't have a problem with you reading something people broadcast to the public internet though.

    > Entertainment has value doesn't it?
If it's procrastination the net value is negative. You may put any value you like on the "entertainment" - but what it displaces has higher value.

If I delay my work (which I do, even now) the overall value of writing comments on HN or reading a WP article that talks about issues that don't directly concern me and that I cannot do anything about is negative, even to myself (and don't try to argue they may concern me indirectly because, well, everything does).

It's like being addicted to drugs: Sure you can argue if the drugs (and let's assume those especially crazy and destructive ones) had no value to the person taking them they would not take them, but a more appropriate model than high school economics would be the neuroscience of addiction. But even if you decide to stick to using an economic model you would have to take a very narrow view - like picking exactly the period where a stock was rising to show how great a pick that company is - to argue the person gets a positive value from taking those drugs.

>You may put any value you like on the "entertainment" - but what it displaces has higher value. //

Disagree. This conversation has a value, I [likely] can't derive anything financial from it, one might term it "entertaining" even [that wasn't supposed to sound quite so denigrating!]. The value is difficult to define, but it doesn't remove value from my life IMO. It possibly takes some time with which I can argue for opportunity cost, but I see the conversation as a generally positive thing.

I'm not sure if a mere conversation can be equated so easily to the value positions involved in drug addiction. However, I would say that it's a mixed bag. Some aspects that come out of drug addiction can have positive value - I'm thinking the progression of the arts: some great works of literature, paintings, dramatic performances, appear to have at least some relationship to the artists drug use [and in some cases addiction, it's hard to know where the divide is].

>"This is no more true than to say that Van Gogh was only Van Gogh because of his inner turmoil or than Jean-Michel Basquiat needed heroin to draw or paint. But it is also worth remembering that it killed them both." (http://www.worldcrunch.com/culture-society/under-the-influen...)

Similar ground with a greater focus on musicians - http://blogs.scientificamerican.com/mind-guest-blog/creativi....

You can disagree all you like - on what basis??? How do YOU know what I'm doing and what my time is worth? This discussion certainly has zero value - more like a negative value to have to encounter so annoying. Ridiculous!

    > This conversation has a value
I covered that!!! Do you actually READ the comments you respond to? I mean, without the filter that removes the things that don't fit your narrative?
There is 'web' link under submission.

32 points by pns 6 hours ago | flag | hide | past | --> web <-- | 17 comments | favorite

https://www.google.com/search?q=An%20alarming%20number%20of%...

Google Search usually gets around paywalls

That usually works for me (frequently getting me to a different AP news site), but not with this particular article. I have access to it now, but I thought I'd pass on the FYI.
I opened the developer tools and added `style="display:none" to the relevant layer, and unchecked the "overflow:hidden" style. Given that this is HN I suspect many other did the same.
Should decison maker's spreadsheets in business, policy, and government be peer-reviewed in the same way as scientific papers?

Disclaimers: 1) Yes, scientific peer review needs improvement. 2) Yes, spreadsheets are not ideal for science... what makes business less important?

I wonder if that also applies to DNA tests used in criminal investigations...
An alarming number of business spreadsheets contain Excel errors. But it's the linga fraca of businesses, departments & teams everywhere.
*Excel mutations