Hacker News new | ask | show | jobs
by rlayton2 1162 days ago
Great to hear. Scikit-learn also typically uses joblib for model persistence: https://scikit-learn.org/0.18/modules/model_persistence.html

Joblib is also used internally for models to be run in parallel, e.g. https://github.com/scikit-learn/scikit-learn/blob/1834cd6b76...

(End users typically just pass n_jobs as a non-zero/one value to use it).

1 comments

Note that joblib serialization is pickle based and therefore has the same security implications as for any pickle file: consider loading a joblib or pickle file as running a compiled executable: never do it if you do not trust the source.

A new safer alternative for scikit-learn model persistence is skops:

- https://skops.readthedocs.io/en/stable/persistence.html

It makes it possible to trust a list of types of Python objects that are safe to load and refuse to load skops files with untrusted types.

Also note that nowadays, with Python 3.8+ and pickle protocol 5, it's now as efficient to do:

  import pickle

  with open("model.pkl", mode="wb") as f:
      pickle.dump(trained_model, f, protocol=pickle.HIGHEST_PROTOCOL)

  with open("model.pkl", mode="rb") as f:
      trained_model = pickle.load(f)
pickle from the standard library with protocol 5 can store and load large data buffers often found as attributes of scikit-learn models (typically large numpy arrays) without extra memory copies (as joblib.dump and joblib.load were designed to do with a few hacks that violate the official pickle protocol).
For reference pickle protocol 5 was specified and implemented as part of:

- https://peps.python.org/pep-0574/

and also provides extra API to handle large data buffers externally ("out-of-band") via custom callbacks. This is in addition to the no-copy semantics memory optimization when loading/storing such arrays "in-band" without providing custom callbacks.