Hacker News new | ask | show | jobs
by bunderbunder 1003 days ago
Pickle files are cringe, but they're also basically unavoidable when working with Python machine learning infrastructure. None of the major ML packages provide a proper model serialization/deserialization mechanism.

In the case of scikit-learn, the code implementing some components does so much crazy dynamic shit that it might not even be feasible to provide a well-engineered serde mechanism without a major rewrite. Or at least, that's roughly what the project's maintainers say whenever they close tickets requesting such a thing.

3 comments

You should check out safetensors. They are used widely in diffusion models and LLMs https://huggingface.co/blog/safetensors-security-audit
ONNX[0], model-as-protosbufs, continuing to gain adoption will hopefully solve this issue.

[0] https://github.com/onnx/onnx

ONNX is cool, but it still only supports a minority of scikit-learn components. Some of them simply aren't compatible with ONNX's basic design.
at work we use the ONNX serialisation format for all of our prod models. Those get loaded by the ONNX runtime for inference. works great.

perhaps it's be viable to add support for the ONNX format even for use cases like model checkpointing during training, etc ?