Hacker News new | ask | show | jobs
by rfoo 763 days ago
.pkl files were, are, and will still be a a common way of sharing data among Python users. Despite it is known to be unsafe since forever and nobody claimed a CVE for this fact.

A few years back I have heard from a lot of people working in ML communities that they are surprised that `numpy.load` is able to execute arbitrary code.

6 comments

There are lots of Python pickle remote code execution CVEs https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=pickle
> A few years back I have heard from a lot of people working in ML communities that they are surprised that `numpy.load` is able to execute arbitrary code.

This is correct, before version 1.16.3 (April 2019) `numpy.load` was unsafe by default, unless explicitly specifying `allow_pickle=False`. However, to be clear, that unsafe default was then fortunately changed. Loading numpy arrays with `numpy.load` should now be safe (unless there are yet-to-be-found bugs in that code).

> Despite it is known to be unsafe since forever and nobody claimed a CVE for this fact.

There have been dozens, if not _hundreds_, of CVEs filed on issues related to pickle and RCE.

Here is a small sample:

https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=pickle

Those are CVEs on other software that use pickle in insecure ways. Not on pickle itself.
In applications using pickle on untrusted data, that's a big distinction. There are a huge number of similar java and c# object serializationg bugs as well.
There aren't in C#. Neither Newtonsoft.JSON (by default) nor System.Text.Json (at all) allow uncontrolled deserialization. Pretty much no code ever defaulted to Newtonsoft's TypeNameHandling.Auto and community has always been aware of its dangers, espcially in light of the incidents like Log4J.

And BinaryFormatter has been long ago deprecated (and now it got completely removed, in the form of a breaking change, something that pretty much never happens otherwise), and even when it was in use (more than a decade ago, popularity-wise), the use of type binding was heavily encouraged.

C# is pretty hard-nosed about serialization.

E.g. My discovery the other day that out of the box C# System.Text.Json can't serialize System.Exception without writing a custom serializer [0] (since 2020, because .NET fix speed...). NewtonSoft handles it fine. (Had wanted a quick-and-dirty debugging dump of properties)

[0] https://github.com/dotnet/runtime/issues/43026

A serializer cannot make a reasonable assumption about how an exception should be serialized on the user's behalf, let alone deserialized. Newtonsoft had and still has quite a few problematic defaults where people can inadvertently weaken privacy and security of the implementation, which System.Text.Json is opinionated in solving.

If you are okay with risks that come from including exception's message in data sent over the network (e.g. not publicly exposed), then defining a custom converter is trivial (it's like 10-15 lines and adding it to serializer options), or you could simply .ToString/.Message it and include that in the payload instead. It's a minute thing.

As for exception deserialization, that's a gross feature misuse and not something that should be done.

To me, it feels like an abuse of "you shouldn't be doing that"-ism.

A serializer should generate a sane serialization of whatever I throw at it. Or at least have an option that allows me to force that.

If I then choose to send that serialization somewhere unreasonable, that's on me.

In my case, I was hacking in C#-on-top-of-another-environment, so didn't have full access to reimplement stuff, without jumping through additional hoops.

That said, absolutely agreed on de-serialization, as larger opportunities for footguns abound.

I was thinking of BinaryFormatter and NetDataContractSerializer, etc. unsafe .NET object deserialization. I'm sure the default JSON serializer in C# is safe (lmao language fanboys)

https://github.com/pwntester/ysoserial.net

Yes but the fact that R was apparently able to fix the issue at all is a bit strange then. You can't "fix" pickle code execution.
Weird. I don't think I've ever relied on pickle for sharing data. It's too version specific. I always dump to json, or similar.
See it all the time.
CVE-2019-6446 seems to be in the right ballpark.