Hacker News new | ask | show | jobs
by cmeacham98 763 days ago
This is uncharitable.

From what I can tell, these RDS files are a common way of sharing data among R users. I would be relatively surprised if reading someone else's dataset was able to execute arbitrary code.

I think this is more like if reading a CSV via numpy could execute code.

3 comments

RDS files are a common way of sharing serialized R objects. Promises are valid R objects and supported by this serialization format. They always have been and I believe it is an intentional feature. The problem is that some people may think of RDS files as more convenient CSV files, but they are not.
CSV is CSV. A serialized object is a serialized object. The main concern they cite, are supply chain attacks. So it’s like saying loading a package can… load a package. Supply chain attacks will always be a thing. I’m grateful for the work of the researchers in question but don’t feel this is much of a blemish when it comes to R itself being insecure.
I think the researchers didn’t identify the main vulnerability. They should have talked about the risk of remote code execution from reading serialized objects from untrusted sources, when the R programmer thinks they are reading data but they are actually running code. This mistake has led to huge numbers of remote code execution vulnerabilities in all sorts of object deserialization libraries; it’s a much more common threat than supply chain attacks.
It’s true that it’s always been that way, but there are other common but unsafe ways of doing things that people eventually stopped using. Some pressure to deprecate and migrate away from unsafe API’s seems good.
Is there another way to load a saved dataset in R though, so that it can't execute anything?
Save it in the usual text-based formats, like a CSV or JSON. Outside of packages, which use serialized data by default for good reasons, I haven't seen many people loading strangers' RDS or RData files.

If an attacker can control a package's rdb and rdx files, it's game over. They could just stick an `.onAttach` function in that does whatever they want when the package is loaded directly or imported by another package.

The fact that they had to mess with unbounded promises, and that the bug got fixed suggests you normally can't run any code from load().
.pkl files were, are, and will still be a a common way of sharing data among Python users. Despite it is known to be unsafe since forever and nobody claimed a CVE for this fact.

A few years back I have heard from a lot of people working in ML communities that they are surprised that `numpy.load` is able to execute arbitrary code.

There are lots of Python pickle remote code execution CVEs https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=pickle
> A few years back I have heard from a lot of people working in ML communities that they are surprised that `numpy.load` is able to execute arbitrary code.

This is correct, before version 1.16.3 (April 2019) `numpy.load` was unsafe by default, unless explicitly specifying `allow_pickle=False`. However, to be clear, that unsafe default was then fortunately changed. Loading numpy arrays with `numpy.load` should now be safe (unless there are yet-to-be-found bugs in that code).

> Despite it is known to be unsafe since forever and nobody claimed a CVE for this fact.

There have been dozens, if not _hundreds_, of CVEs filed on issues related to pickle and RCE.

Here is a small sample:

https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=pickle

Those are CVEs on other software that use pickle in insecure ways. Not on pickle itself.
In applications using pickle on untrusted data, that's a big distinction. There are a huge number of similar java and c# object serializationg bugs as well.
There aren't in C#. Neither Newtonsoft.JSON (by default) nor System.Text.Json (at all) allow uncontrolled deserialization. Pretty much no code ever defaulted to Newtonsoft's TypeNameHandling.Auto and community has always been aware of its dangers, espcially in light of the incidents like Log4J.

And BinaryFormatter has been long ago deprecated (and now it got completely removed, in the form of a breaking change, something that pretty much never happens otherwise), and even when it was in use (more than a decade ago, popularity-wise), the use of type binding was heavily encouraged.

C# is pretty hard-nosed about serialization.

E.g. My discovery the other day that out of the box C# System.Text.Json can't serialize System.Exception without writing a custom serializer [0] (since 2020, because .NET fix speed...). NewtonSoft handles it fine. (Had wanted a quick-and-dirty debugging dump of properties)

[0] https://github.com/dotnet/runtime/issues/43026

A serializer cannot make a reasonable assumption about how an exception should be serialized on the user's behalf, let alone deserialized. Newtonsoft had and still has quite a few problematic defaults where people can inadvertently weaken privacy and security of the implementation, which System.Text.Json is opinionated in solving.

If you are okay with risks that come from including exception's message in data sent over the network (e.g. not publicly exposed), then defining a custom converter is trivial (it's like 10-15 lines and adding it to serializer options), or you could simply .ToString/.Message it and include that in the payload instead. It's a minute thing.

As for exception deserialization, that's a gross feature misuse and not something that should be done.

I was thinking of BinaryFormatter and NetDataContractSerializer, etc. unsafe .NET object deserialization. I'm sure the default JSON serializer in C# is safe (lmao language fanboys)

https://github.com/pwntester/ysoserial.net

Yes but the fact that R was apparently able to fix the issue at all is a bit strange then. You can't "fix" pickle code execution.
Weird. I don't think I've ever relied on pickle for sharing data. It's too version specific. I always dump to json, or similar.
See it all the time.
CVE-2019-6446 seems to be in the right ballpark.