Hacker News new | ask | show | jobs
by fiatmoney 4694 days ago
It's not a terrible idea to support the absolute basics like mean & variance, but anything beyond that (particularly things like models or tests) is not a good idea for a standard library. Once you hit even something simple like a linear regression you have issues of how to represent missing or discrete variables, handling colinearity, or whether to do online or batch modes which can give different results. Tests in particular are fraught because if you're going to make them available for general consumption they need a good explanation of when they're appropriate, which is basically a semester course in statistics and well out of scope for standard library docs.

Basically, the idea of "batteries included" should also mean that if something looks like you can put a D-cell in there, you're unlikely to blow your arm off.

4 comments

The PEP suggests that the functionality would be comparable to that in a high school calculator or in Excel/LibreOffice/Gnumeric. The existence of that functionality suggests to me that a stats package can be useful even if it doesn't handle things like missing data.

Similarly, Excel/etc. support these functions without a "semester course in statistics." Instead, you'll find that there are many web pages from semester courses in statistics which end up teaching how to use Excel. The same would no doubt happen with Python.

I don't why a statistics standard library module needs to provide a "good explanation of when they're appropriate" to a higher standard than any other module. Python provides trigonometric and hyperbolic functions without teaching trigonometry. It provides complex numbers and cmath without teaching people about complex numbers. It provides several different random distribution functions without teaching anything about Pareto, Weibull, or von Mises distributions.

For that matter, data structures is a semester course as well, but the Python documentation doesn't teach those differences in its documentation of deque, stack, hash table, etc., nor describe algorithms like heapq and bisect.

"whether to do online or batch modes which can give different results". The PEP says it will prefer batch mode:

      Concentrate on data in sequences, allowing two-passes over the data,
      rather than potentially compromise on accuracy for the sake of a one-pass
      algorithm
By that logic, you should get rid of about everything in standard libraries. A simple RNG? Anything related to floating point? Multi-threading? Character encodings? Unicode normalization? Time and date handling? All fraught with danger for those who do not know, and "basically a semester course and well out of scope for standard library docs".

Surely, it would be better to supply good implementations of algorithms rather than refrain from doing that, and letting programmers write and use bad ones instead?

IMO, the discussion should be about what c/should end up in the _standard_ library, and what is better put in a separate product/download.

Agreed. I'm studying statistics at the moment and I'm continually reminded of how easy it is to choose the wrong model / distribution and be incorrect because of some non-obvious and technical reason. For example, just the other day, I wanted to use the binomial distribution to solve a problem. To use this distribution, the trials must be independent of one another. In that particular problem, there was a subtle condition that made the trials non-independent. I arrived at correct-appearing answers (0 <= P <= 1) that were actually all wrong. Statistics is way too easy to break to be used naively.
> Statistics is way too easy to break to be used naively.

Fair enough, but the same argument could be made about using an unskewed standard distribution on non-symmetrical datasets, a common error even among people who should know better.

I think binomial functions should be included, on the ground that they're very useful and their probability of misuse is only equal to the continuous statistical forms, not more so.

Hell, sometimes they use a dictionary when they should be using a list. Almost everything can be used wrongly by a begginer, which doesn't mean it shouldn't be there.

I think having a basic stats module always handy would be very convenient.

> I think having a basic stats module always handy would be very convenient.

I absolutely agree. My only point was that these tools are sometimes misapplied, not at all to argue that they shouldn't be readily available. They should be.

While you are correct, it sounds like his point was that even to someone moderately skilled it is easy to make a mistake that makes your work __completely__ invalid, rather than merely inefficient or too-complicated.

  ...how to represent missing or discrete variables...
Don't. Just say no. Just give me the simple easy stuff. Most of us will be fine, and everyone else will know they need something better and won't bother.