Hacker News new | ask | show | jobs
by jasode 4297 days ago
>There is no excuse for a spreadsheet quietly taking a never-assigned cell as zero, but indeed it does. WHAT THE HELL WERE THESE PEOPLE SMOKING?

I think the head scratching about Excel's (and probably other spreadsheets) behavior on empty cells is misguided. Yes, if you have a computer scientist mindset then it might seem very mathematically satisfying to have strict rigorous logic around empty cells but we're getting into 3-valued NULL handling like databases. That type of extra logic checking is misplaced considering the origins of spreadsheets (e.g. Dan Bricklin's VisiCalc) as a business computation tool instead of a GUI for an E.F. Codd relational db.

If a particular power user finds the typical "null" cell behavior to be "wrong", he can add formulas to his cells such as:

=IF(ISBLANK(C3), NA(), C3)

... such that #NA# (Excel's quasi "null") propagates throughout any SUM() and other calculations.

However, imagine if the situation was reversed and we had an alternate universe where Excel propagated empty cells as #NA# instead of zero to satisfy the logic of computer scientists. We'd inevitably have:

1) Microsoft adds a button wizard on the toolbar to bulk fill "null" cells with 0.

2) User forums with power-users trading VBA macros to fill in zeros of empty cells. "Ugh! How do I get rid of all these useless #NA# littering my spreadsheet???"

3) Microsoft adds an user option checkbox to "treat empty cells as zero instead of #NA#" -- which everyone ends up activating as a defacto setting. This cancels out the strict null handling the computer scientists were imposing on the spreadsheet!

I contend this alternate universe scenario is much worse for business users of Excel. In other words, you can't "force fit" the concept of NULL as a default where it is not natural. The end users will just work their way around it.

4 comments

Yeah, hate to break it to the OP, because God knows I speak against the use of spreadsheets whenever possible...but the most common use-case for a spreadsheet is not as part of a reproducible, automated data pipeline. Many times, it's hand-entered data entry, with the goal of making one of the pre-baked Excel visualizations. For users who want type-checking, there's Access, which is more attuned to handling non-trivial datasets.
> Many times, it's hand-entered data entry

Yes, many companies pays good money for people to update those Excel sheets.

Me? I see almost every single Excel sheet that is updated over more than a week as a possible sale. But I'm not replacing Excel spreadsheets full time, at least not yet : )

It's just a good thing no one ever makes important decisions based on spreadsheets.
I don't understand why its usual to assume that the only two choices are

  * show N/A everywhere
  * silently ignore sources of potential errors.
Why not make the calculation while adding a visual cue (e.g. a yellow or red indicator in the corner) that marks the field as tainted? Now the users get to decide whether they want to investigate or ignore that.
You mean like Excel actually does? http://i.imgur.com/kMPKnqO.png
Does that propagate, or is it only one hop?
Yes.
> if you have a computer scientist mindset...

Even things written by computer scientists sometimes follow the empty-is-zero model for convenience. awk's arrays behave very similarly to Excel's cell grid in this particular manner: they are conceptually of infinite size, and any element not yet set is implicitly 0. Hence you can calculate word-counts with something along the lines of count[$i]++ without first checking whether count[$i] exists; and you can also sum the counts of a predefined list of words, without throwing in if-count[$i]-is-defined checks.

You can't accuse Alfred Aho (the 'A' in 'awk') of not being a computer scientist. Yet in awk, nonexistent varaibles serve as zero or empty strings, and strings that look like numbers can be used for arithmetic. This is a matter of pragmatism which depends on the goal of the tool and the target users.

No doubt, it poorly serves users who want to create huge, complicated spreadsheets, or large programs.

Spreadsheets could address the problem by having this as a cell property. Suppose you could highlight a rectangular region of the spreadsheet and mark the whole region as having "strict initialization": any cells in the region which are empty will trigger a diagnostic if they are accessed.

What if instead of "propagating empty cells as #NA everywhere", instead the functions operating on cell values returned #NA by default if an empty cell was present in the input, but also included an optional parameter which was either a true/false flag to treat the value as the appropriate identity (makes sense for things like SUM, etc.) or which provides a default to use in the case of an empty cell (which may make more sense to use for functions that aren't the append function of a monoid.)

(Though I think the failure to use exact arbitrary precision numeric values except where an imprecise operation forces fallback to limit precision binary floating-point approximations -- and then visually distinguishing when the latter has occurred -- is a bigger problem for common spreadsheet use cases than the bad null handling.)