| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rlayton2 1162 days ago

Great to hear. Scikit-learn also typically uses joblib for model persistence: https://scikit-learn.org/0.18/modules/model_persistence.html

Joblib is also used internally for models to be run in parallel, e.g. https://github.com/scikit-learn/scikit-learn/blob/1834cd6b76...

(End users typically just pass n_jobs as a non-zero/one value to use it).

1 comments

ogrisel 1161 days ago

Note that joblib serialization is pickle based and therefore has the same security implications as for any pickle file: consider loading a joblib or pickle file as running a compiled executable: never do it if you do not trust the source.

A new safer alternative for scikit-learn model persistence is skops:

- https://skops.readthedocs.io/en/stable/persistence.html

It makes it possible to trust a list of types of Python objects that are safe to load and refuse to load skops files with untrusted types.

link

ogrisel 1161 days ago

Also note that nowadays, with Python 3.8+ and pickle protocol 5, it's now as efficient to do:

  import pickle

  with open("model.pkl", mode="wb") as f:
      pickle.dump(trained_model, f, protocol=pickle.HIGHEST_PROTOCOL)

  with open("model.pkl", mode="rb") as f:
      trained_model = pickle.load(f)

pickle from the standard library with protocol 5 can store and load large data buffers often found as attributes of scikit-learn models (typically large numpy arrays) without extra memory copies (as joblib.dump and joblib.load were designed to do with a few hacks that violate the official pickle protocol).

link

ogrisel 1161 days ago

For reference pickle protocol 5 was specified and implemented as part of:

- https://peps.python.org/pep-0574/

and also provides extra API to handle large data buffers externally ("out-of-band") via custom callbacks. This is in addition to the no-copy semantics memory optimization when loading/storing such arrays "in-band" without providing custom callbacks.

link