From what I can tell, these RDS files are a common way of sharing data among R users. I would be relatively surprised if reading someone else's dataset was able to execute arbitrary code.
I think this is more like if reading a CSV via numpy could execute code.
RDS files are a common way of sharing serialized R objects. Promises are valid R objects and supported by this serialization format. They always have been and I believe it is an intentional feature. The problem is that some people may think of RDS files as more convenient CSV files, but they are not.
CSV is CSV. A serialized object is a serialized object. The main concern they cite, are supply chain attacks. So it’s like saying loading a package can… load a package. Supply chain attacks will always be a thing. I’m grateful for the work of the researchers in question but don’t feel this is much of a blemish when it comes to R itself being insecure.
I think the researchers didn’t identify the main vulnerability. They should have talked about the risk of remote code execution from reading serialized objects from untrusted sources, when the R programmer thinks they are reading data but they are actually running code. This mistake has led to huge numbers of remote code execution vulnerabilities in all sorts of object deserialization libraries; it’s a much more common threat than supply chain attacks.
It’s true that it’s always been that way, but there are other common but unsafe ways of doing things that people eventually stopped using. Some pressure to deprecate and migrate away from unsafe API’s seems good.
Save it in the usual text-based formats, like a CSV or JSON. Outside of packages, which use serialized data by default for good reasons, I haven't seen many people loading strangers' RDS or RData files.
If an attacker can control a package's rdb and rdx files, it's game over. They could just stick an `.onAttach` function in that does whatever they want when the package is loaded directly or imported by another package.
.pkl files were, are, and will still be a a common way of sharing data among Python users. Despite it is known to be unsafe since forever and nobody claimed a CVE for this fact.
A few years back I have heard from a lot of people working in ML communities that they are surprised that `numpy.load` is able to execute arbitrary code.
> A few years back I have heard from a lot of people working in ML communities that they are surprised that `numpy.load` is able to execute arbitrary code.
This is correct, before version 1.16.3 (April 2019) `numpy.load` was unsafe by default, unless explicitly specifying `allow_pickle=False`. However, to be clear, that unsafe default was then fortunately changed. Loading numpy arrays with `numpy.load` should now be safe (unless there are yet-to-be-found bugs in that code).
In applications using pickle on untrusted data, that's a big distinction. There are a huge number of similar java and c# object serializationg bugs as well.
There aren't in C#. Neither Newtonsoft.JSON (by default) nor System.Text.Json (at all) allow uncontrolled deserialization. Pretty much no code ever defaulted to Newtonsoft's TypeNameHandling.Auto and community has always been aware of its dangers, espcially in light of the incidents like Log4J.
And BinaryFormatter has been long ago deprecated (and now it got completely removed, in the form of a breaking change, something that pretty much never happens otherwise), and even when it was in use (more than a decade ago, popularity-wise), the use of type binding was heavily encouraged.
E.g. My discovery the other day that out of the box C# System.Text.Json can't serialize System.Exception without writing a custom serializer [0] (since 2020, because .NET fix speed...). NewtonSoft handles it fine. (Had wanted a quick-and-dirty debugging dump of properties)
I was thinking of BinaryFormatter and NetDataContractSerializer, etc. unsafe .NET object deserialization. I'm sure the default JSON serializer in C# is safe (lmao language fanboys)
I think a good response from the R authors should:
• Make clear the bug is due to unsafe deserialization (not serialization as their statement says). This is important because unsafe deserialization is a major source of remote code execution vulnerabilities.
• Update the documentation to make it clear that R’s serialization and deserialization functions are not safe to use for sharing data across the network. Serialized objects should be treated as code, not data.
>and a blog post for bragging, thankfully they didn't do a name and a logo.
I am still amazed on how many people on HN seem to get worked up over vulnerability names. God forbid someone also slaps a piece of clip art or whatever on the blog post. Worse yet, if they buy a $5 domain... the horror!
Maybe it's just me, but I'd much rather remember "Heartbleed" over "CVE-2014-0160".
It's fine when your bugs are (unanimously) cool, be it Heartbleed, Meltdown, Spectre or Load Value Injection (this one gets a hilarious video even).
For less cool bugs a logo and a name seems rather... strange, because it happens all the time and it's not clear why it's special. Imagine a coworker fixed a random JIRA ticket which may be "switching to night mode does not work on a certain page" and then named it "Nightfall" and a logo and a landing page and a lot of bragging in the next periodic meeting.
> Imagine a coworker fixed a random JIRA ticket which may be "switching to night mode does not work on a certain page" and then named it "Nightfall" and a logo and a landing page and a lot of bragging in the next periodic meeting.
From what I can tell, these RDS files are a common way of sharing data among R users. I would be relatively surprised if reading someone else's dataset was able to execute arbitrary code.
I think this is more like if reading a CSV via numpy could execute code.