Hacker News new | ask | show | jobs
by infimum 1734 days ago
scikit-learn (next to numpy) is the one library I use in every single project at work. Every time I consider switching away from python I am faced with the fact that I'd lose access to this workhorse of a library. Of course it's not all sunshine and rainbows - I had my fair share of rummaging through its internals - but its API design is a de-facto standard for a reason. My only recurring gripe is that the serialization story (basically just pickling everything) is not optimal.
3 comments

I recently ran into this issue as well. Serialization of sklearn random forests results in absolutely massive files. I had to switch to lightgbm, which is 100x faster to load from a save file and about 20x smaller.
What's a typical task you do with sklearn? Just trying to get inspired about what it can do
There is so much wrong with the api design of sklearn (how can one think "predict_proba" is a good function name?). I can understand this, since most of it was probably written by PhD students without the time and expertise to come up with a proper api; many of them without a CS background.[1]

[1] https://www.reddit.com/r/haskell/comments/7brsuu/machine_lea...

These seem like minor gripes (reading your link) - and I don't even agree with them, seems like an ok use of mutable state (otherwise a separate object would be needed for hyperparameter state?). Maybe my expectations are low, but they way sklearn unifies the API across different estimators all across the library - that's already way above what you can expect - especially if you consider it to be "written by a bunch of phd students".
I didn't want to bag on sklearn (I've already bagged on pandas enough here), but for what it's worth I agree with you. It's, ahh, not the API I would've come up with. It's what everybody has standardized on, though, and maybe there's some value in that.