Its always an interesting dynamic: assuming a high trust society pays dividends - Python would be nowhere close the success it has been without PyPI.
But then success attracts trust abusers and forces raising the fences (which comes with higher costs, both direct and indirect).
Direct costs in the people and infrastructure that must be dedicated to the task. Indirect costs in the frictions generated by complicating workflows.
It all points to the need for open source ecosystems to be taken more seriously by the economically able users who most benefit from this amazing development.
In an alternative system you can get a salary from the government to work on open source software, and the companies pay for that in taxes. Of course you must embargo Malta, Netherlands and all the other countries that thrive on grabbing taxes from other countries.
> The one project cleared was a project containing obfuscated code, in violation of the PyPI Acceptable Use Policy.
Interesting, I didn’t know that. While I haven’t released anything obfuscated on PyPI, I’ve certainly written Python projects that include obfuscated code by necessity, namely scrapers packing duktape (embedded JS interpreter) and third party obfuscated JS blobs to generate signatures and stuff. I know for a fact there are projects like that on PyPI. I wonder if those are allowed.
(Come to think of it, those probably can be DMCAed if the targeted service provider is sufficiently motivated.)
The still don't even have a way to avoid dependency confusion attacks when using private package repos (other than also registering every single private package name you use on pypi.org). Blows my mind.
Who is "they"? PyPI is an index; it doesn't control your installing client.
(This is a larger issue - or feature, depending on your perspective - with Python packaging. But it's important to understand that PyPI itself can't force `pip` or any other client to pick any particular resolution order between indices.)
The fact that pip is the official client isn’t in dispute. The point was that pip and PyPI are different entities, per a larger pattern of devolved ownership/control/standards-over-tools in Python packaging. PyPI has little to no say over how pip and other tools choose to handle resolutions across multiple indices.
They have a say insofar as they can participate in the same standards process as everyone else. But no, the PSF has no unique say in how PyPI is run, or how pip behaves. This is a pretty fundamental aspect of how Python-qua-ecosystem works.
The issue does not require coordination; that's the point. It's a behavioral aspect of `pip` that's completely opaque to PyPI, because all PyPI does is serve index responses to installers. It doesn't know how many indices the installer contacts, or the order in which it contacts them (and it has no good reason to know those things, ever).
If you're concerned about dependency confusion attacks you should host your own index and vet what goes on to it.
But there is a better solution coming, PEP 708 was developed for this and is in prototype on pypi.org, so it's an overstatement to say "don't even have a way to avoid dependency confusion attacks ".
It is, however, a non-trivial problem, and more solutions will likely come over the years, many Python packaging tools like uv and poetry (and likely others) have way to name indexes and pin specific packages to indexes, which appears to be a promising UX.
Your comment also has me dreaming about a Dependabot-esque utility that opens Github issues on repositories that have quarantined projects in their requirements.txt.
Quarantining would prevent anyone from building / installing new copies of the compromised software, so this utility would only help people who were a) monitoring the project, and b) had a local version installed pre-quarantine. That's a pretty narrow scope of users, so now that I type all this out, I'm realizing that the juice is likely not worth the squeeze.
One of my responsibilities is software supply chain security in a financial services org, so this signal would be valuable for vulnerability management of dependencies. I wouldn't call it "threat hunting" per se, but ground truth around threat actor patterns helps us build better defensive systems in this regard. Keeping the bad bits out is way easier than remediating once they've been ingested into systems.
> Your comment also has me dreaming about a Dependabot-esque utility that opens Github issues on repositories that have quarantined projects in their requirements.txt.
It's not a bad idea, let Github know! Their security team is very good from my interactions with them.
It’s not used in the core or for anything load bearing, but has some ancillary uses, and we strive for total coverage (as much as practical). If we use something, we want to secure it as best we can.
Given how widespread PyPI usage is, I'm surprised they only have one full time security staff. I mean I guess it makes sense, usage doesn't always mean they get more donations/money, but damn.
really depends on the company. my company cares a lot about security because it's a huge fortune 50 company with sensitive data and a lot of reputation could be lost with a security scandal
How do you decide what externally available packages to store/cache in artifactory?
I’m curious, as I also deal with this tension. What (human and automated) processes do you have for the following scenarios?
1. Application developer wants to test (locally or in a development environment) and then use a net new third party package in their application at runtime.
2. Application developer wants to bump the version used of an existing application dependency.
3. Application developer wants to experiment with a large list of several third party dependencies in their application CI system (e.g. build tools) or a pre-production environment. The experimentation may or may not yield a smaller set of packages that they want to permanently incorporate into the application or CI system.
How, if at all, do you go about giving developers access via jfrog to the packages they need for those scenarios? Is it as simple as “you can pull anything you want, so long as X-ray scans it”, or is there some other process needed to get a package mirrored for developer use?
Where i am, every package repo - docker, pypi, rpm, deb, npm, and more - all go through artifactory and are scanned. Packages are autopulled into artifactory when a user requests the package and scanned by xray. Artifactory has a remote pull through process that downloads once from the remote, and then never again unless you nuke the content. Vulnerable packages must have exceptions made in order to get used. Sadly, we put the burden of allowances on the person requesting the package, but it at least makes them stop and think before they approve it. Granting access to new external repos is easy, and we make requesting them painfree, just making sure that we enable xray. Artifactory also supports local repos so users can upload their packages and pull them down later.
the fact that `pip install` just runs whatever is in `setup.py` is still mind baffling, even if the author weren't mallicious the `setup.py` can still do harm (say delete a file by mistake), there really needs to be an official way of sandbox its running.
Note that it's possible to disable that behavior with `pip install --only-binary :all:`.
This way, pip will fail if a dependency does not provide a `.whl` package, instead of automatically falling back to the "build from source" mode that can lead to arbitrary code execution at install time (via setuptools' `setup.py` or any other build backend mechanism).
However, installing from wheels just protects from arbitrary code execution at install time. If you do not trust the source and integrity of the package you install, you would still be subject to arbitrary code execution at import time.
Therefore, tools and processes to improve package provenance tracing and integrity checking are useful for both kinds of installations.
I think sometimes the problem is coming from accidental typos instead of not trusting, say if one accidentally typed `pip install requests` into `pip install requestss` and if `requestss` is malacious then by the time one noticed the typo the setup.py could have already run to do the harm
It's not good, but it should also not be baffling: it's the exact same thing other ecosystems do (npm with install hooks/scripts, Rust with build.rs, Ruby with gemspecs, etc).
Sandboxing is a great idea. But the fact that this is a near-universal feature of language packaging reveals a preference that's going to be hard to counter: users do want effectively-arbitrary system access at build time, because that's the paradigm that's supported by the million-and-one different ways in which a build environment can be valid.
Quarantining projects is just a band-aid. If you’re worried about malware, maybe stop letting random people upload code to the official package index. Or just write better docs so people stop using random packages in the first place.
I see some comments about the lack of security of Pypi. And they are totally right, I’m also concerned. But to be fair, many other languages don’t fare better in that arena. I don’t want to give examples, but everyone knows horror histories with other languages.
Again, is not that because others are worse, is ok, but I would cut a little slack. Specially for the fact that having all packages somehow signed/audited would be a titanic task. And I guess I’m not willing to pay for it.
But then success attracts trust abusers and forces raising the fences (which comes with higher costs, both direct and indirect).
Direct costs in the people and infrastructure that must be dedicated to the task. Indirect costs in the frictions generated by complicating workflows.
It all points to the need for open source ecosystems to be taken more seriously by the economically able users who most benefit from this amazing development.