Hacker News new | ask | show | jobs
PyPI Blog: Project Quarantine (blog.pypi.org)
92 points by miketheman 527 days ago
9 comments

Its always an interesting dynamic: assuming a high trust society pays dividends - Python would be nowhere close the success it has been without PyPI.

But then success attracts trust abusers and forces raising the fences (which comes with higher costs, both direct and indirect).

Direct costs in the people and infrastructure that must be dedicated to the task. Indirect costs in the frictions generated by complicating workflows.

It all points to the need for open source ecosystems to be taken more seriously by the economically able users who most benefit from this amazing development.

They won't pay anything unless they are forced to do so. Basic capitalism brings to externalise costs to society
Perhaps, but can you explain how an alternative to capitalism wouldn’t result in people no paying for a service they don’t have to pay for?
People in more communally oriented societies pay for things they "don't have to" pay for because there's a social obligation.
In an alternative system you can get a salary from the government to work on open source software, and the companies pay for that in taxes. Of course you must embargo Malta, Netherlands and all the other countries that thrive on grabbing taxes from other countries.
> The one project cleared was a project containing obfuscated code, in violation of the PyPI Acceptable Use Policy.

Interesting, I didn’t know that. While I haven’t released anything obfuscated on PyPI, I’ve certainly written Python projects that include obfuscated code by necessity, namely scrapers packing duktape (embedded JS interpreter) and third party obfuscated JS blobs to generate signatures and stuff. I know for a fact there are projects like that on PyPI. I wonder if those are allowed.

(Come to think of it, those probably can be DMCAed if the targeted service provider is sufficiently motivated.)

They also allow binary packages if you want an easy way of hiding malware.
The still don't even have a way to avoid dependency confusion attacks when using private package repos (other than also registering every single private package name you use on pypi.org). Blows my mind.
Who is "they"? PyPI is an index; it doesn't control your installing client.

(This is a larger issue - or feature, depending on your perspective - with Python packaging. But it's important to understand that PyPI itself can't force `pip` or any other client to pick any particular resolution order between indices.)

For all intents and purposes "pip" is the official client. It is referenced in the official documentation https://docs.python.org/3/installing/index.html
The fact that pip is the official client isn’t in dispute. The point was that pip and PyPI are different entities, per a larger pattern of devolved ownership/control/standards-over-tools in Python packaging. PyPI has little to no say over how pip and other tools choose to handle resolutions across multiple indices.
The PSF has a saying in which is the default installer and how pypi is run.
They have a say insofar as they can participate in the same standards process as everyone else. But no, the PSF has no unique say in how PyPI is run, or how pip behaves. This is a pretty fundamental aspect of how Python-qua-ecosystem works.
PSF has little control over anything. The Python ecosystem is consensus-based.
> Who is "they"?

The PyPI and Pip developers of course.

Those are largely disjoint sets, and the post in question is about PyPI.
So? The issue requires coordination between Pip and PyPI. I don't see what point you're trying to make.
The issue does not require coordination; that's the point. It's a behavioral aspect of `pip` that's completely opaque to PyPI, because all PyPI does is serve index responses to installers. It doesn't know how many indices the installer contacts, or the order in which it contacts them (and it has no good reason to know those things, ever).
If you're concerned about dependency confusion attacks you should host your own index and vet what goes on to it.

But there is a better solution coming, PEP 708 was developed for this and is in prototype on pypi.org, so it's an overstatement to say "don't even have a way to avoid dependency confusion attacks ".

It is, however, a non-trivial problem, and more solutions will likely come over the years, many Python packaging tools like uv and poetry (and likely others) have way to name indexes and pin specific packages to indexes, which appears to be a promising UX.

Awesome work, kudos to the PyPI team. Will it be possible to receive notifications of projects quarantine as a member of the public?
Your comment also has me dreaming about a Dependabot-esque utility that opens Github issues on repositories that have quarantined projects in their requirements.txt.

Quarantining would prevent anyone from building / installing new copies of the compromised software, so this utility would only help people who were a) monitoring the project, and b) had a local version installed pre-quarantine. That's a pretty narrow scope of users, so now that I type all this out, I'm realizing that the juice is likely not worth the squeeze.

One of my responsibilities is software supply chain security in a financial services org, so this signal would be valuable for vulnerability management of dependencies. I wouldn't call it "threat hunting" per se, but ground truth around threat actor patterns helps us build better defensive systems in this regard. Keeping the bad bits out is way easier than remediating once they've been ingested into systems.

> Your comment also has me dreaming about a Dependabot-esque utility that opens Github issues on repositories that have quarantined projects in their requirements.txt.

It's not a bad idea, let Github know! Their security team is very good from my interactions with them.

That sounds quite daunting, Python and supply chain security are almost at odds with each other these days.

Lowkey surprised that any well-resourced org would use it given the outsized risk profile and poor performance.

It’s not used in the core or for anything load bearing, but has some ancillary uses, and we strive for total coverage (as much as practical). If we use something, we want to secure it as best we can.
Given how widespread PyPI usage is, I'm surprised they only have one full time security staff. I mean I guess it makes sense, usage doesn't always mean they get more donations/money, but damn.
companies that actually care about security have a more secure solution and don't allow devs to use pypi
You’d be surprised by the amount of companies handling critical infrastructure that are OK with using PyPI directly
He said companies that care, not companies that should care but do not.
really depends on the company. my company cares a lot about security because it's a huge fortune 50 company with sensitive data and a lot of reputation could be lost with a security scandal
That is somewhat terrifying
For example we have it behind a kind of transparent proxy, where you get only packages which were tested and scan by a team of experts.
Could you give some examples of more secure solutions?
jfrog is the one my company uses
How do you decide what externally available packages to store/cache in artifactory?

I’m curious, as I also deal with this tension. What (human and automated) processes do you have for the following scenarios?

1. Application developer wants to test (locally or in a development environment) and then use a net new third party package in their application at runtime.

2. Application developer wants to bump the version used of an existing application dependency.

3. Application developer wants to experiment with a large list of several third party dependencies in their application CI system (e.g. build tools) or a pre-production environment. The experimentation may or may not yield a smaller set of packages that they want to permanently incorporate into the application or CI system.

How, if at all, do you go about giving developers access via jfrog to the packages they need for those scenarios? Is it as simple as “you can pull anything you want, so long as X-ray scans it”, or is there some other process needed to get a package mirrored for developer use?

Where i am, every package repo - docker, pypi, rpm, deb, npm, and more - all go through artifactory and are scanned. Packages are autopulled into artifactory when a user requests the package and scanned by xray. Artifactory has a remote pull through process that downloads once from the remote, and then never again unless you nuke the content. Vulnerable packages must have exceptions made in order to get used. Sadly, we put the burden of allowances on the person requesting the package, but it at least makes them stop and think before they approve it. Granting access to new external repos is easy, and we make requesting them painfree, just making sure that we enable xray. Artifactory also supports local repos so users can upload their packages and pull them down later.
the fact that `pip install` just runs whatever is in `setup.py` is still mind baffling, even if the author weren't mallicious the `setup.py` can still do harm (say delete a file by mistake), there really needs to be an official way of sandbox its running.
Note that it's possible to disable that behavior with `pip install --only-binary :all:`.

This way, pip will fail if a dependency does not provide a `.whl` package, instead of automatically falling back to the "build from source" mode that can lead to arbitrary code execution at install time (via setuptools' `setup.py` or any other build backend mechanism).

However, installing from wheels just protects from arbitrary code execution at install time. If you do not trust the source and integrity of the package you install, you would still be subject to arbitrary code execution at import time.

Therefore, tools and processes to improve package provenance tracing and integrity checking are useful for both kinds of installations.

I think sometimes the problem is coming from accidental typos instead of not trusting, say if one accidentally typed `pip install requests` into `pip install requestss` and if `requestss` is malacious then by the time one noticed the typo the setup.py could have already run to do the harm
It's not good, but it should also not be baffling: it's the exact same thing other ecosystems do (npm with install hooks/scripts, Rust with build.rs, Ruby with gemspecs, etc).
I know other ecosystems do the same and those are baffling too, especially for the newer created languages like rust, which is why https://internals.rust-lang.org/t/pre-rfc-sandboxed-determin... exists
Sandboxing is a great idea. But the fact that this is a near-universal feature of language packaging reveals a preference that's going to be hard to counter: users do want effectively-arbitrary system access at build time, because that's the paradigm that's supported by the million-and-one different ways in which a build environment can be valid.
Notably also common lisp (quicklisp)
I don't think that makes much of a difference from the risk of bugs in the rest of the package when it's run.
https://socket.dev/ does a good job in detecting malicious packages in npm.

In their FAQ[1], they mention that they have plans to expand to PyPI as well.

[1]: https://docs.socket.dev/docs/faq

Quarantining projects is just a band-aid. If you’re worried about malware, maybe stop letting random people upload code to the official package index. Or just write better docs so people stop using random packages in the first place.
I see some comments about the lack of security of Pypi. And they are totally right, I’m also concerned. But to be fair, many other languages don’t fare better in that arena. I don’t want to give examples, but everyone knows horror histories with other languages.

Again, is not that because others are worse, is ok, but I would cut a little slack. Specially for the fact that having all packages somehow signed/audited would be a titanic task. And I guess I’m not willing to pay for it.