Hacker News new | ask | show | jobs
by pudo 1310 days ago
After reading this I’m fully confused by how they define dark matter. Stuff that doesn’t come from the distro package manager? Everything installed via other mechanisms? Assets copied into the container as part of the build mechanism?

Wouldn’t it make more sense to define dark matter as all the stuff that is installed in a container but never activated (unless exploited?)

5 comments

That's their explicit definition: "Software dark matter refers to files that are not tracked by operating system (OS) package managers (like `apt` or `apk`), which renders these files and the packages they represent invisible—or at least complicated to find—to software composition analysis and security scanning tools."

That seems to specifically exclude software installed by, say, language-specific package managers (Cargo, Rubygems, npm and derivatives) -- which on the whole seems pretty perverse. Dealing with those does indeed complicate SBOM maintenance -- but people use them anyway for very good reasons (which sometimes include getting more secure versions of the packaged code!), and having tools that work in the real world requires dealing with that complexity, not wishing it away.

This is good to underline, but I found this confusing or contradicting as well.

Also because of as they write about containers. In a container all files are tracked. That's the container.

Different meaning of "tracked." This is about static-analysis systems that seek to understand the "provenance" of the files that go into the container-image, so that they can alert you to vulnerabilities in the container's dependencies.

"Dark matter" here is anything these tools can't see / notice vulnerabilities in.

So any DB container by definition would have massively high percentage just because DB app itself is few tens of MB but database data is in tens of gigabytes ?

Seems like really useless metric for containers.

I can get it for OSes (some packages there do manage DB data, and even have option to remove it when removing package) but for container it does seem a bit pointless

No...? Again, we're talking about container images, not containers. Specifically, public container images sitting in registries like Docker Hub. People aren't burning their Postgres data into a container image and then pushing it, public-readable, to an image registry.

(But also, even ignoring that, I believe the metric used by the article is number-of-files, not byte-size. A DB might be large in byte-size, but is usually relatively negligible in number-of-files, usually holding individual table chunk files of 1GB or larger.)

As the container is the result of a build process, unless the tools aren't the build tools themselves, the whole container should be treated dark matter and just rebuild. It's process, not state.
It's the build process for the container-image (i.e. the Dockerfile or equivalent) that the tooling being discussed here is analyzing; not the resultant container image, nor containers spawned from said image.

The goal is, presumably, to figure out when a given docker image was created in such a way that it burns in a vulnerable version of some library; so that the author can be alerted that they need to (update their Dockerfile and) rebuild their image.

"Dark matter", under this definition, is anything that gets injected during the build process of the image, that is not itself traceable to some other versioned package management system with vulnerable-version deprecation. Without such information, an automated agent like the one described in the article cannot then propagate deprecations from consumed package-versions to produced image-tags.

A good example of such "dark matter" would be a static binary built outside the Dockerfile using a CI system, where the CI then creates a docker image by running a Dockerfile that simply injects the expected prebuilt binary into an image with an ADD stanza. Does that binary contain vulnerable versions of embedded static libraries? Who knows?

Not sure it is that easy. The Docker API provides introspection for those as well as also there is no Light Matter only because the example project is not using an ADD stanza any longer but the Dockerfile context is from a tar ball created by that project as a reproducible build artefact.
This is basically the definition we used. It's practically important because scanners really do miss software copied in via other mechanisms, and most of them give zero indication about it. For a few basic examples, try running your favorite scanner on the wordpress, node, or busybox images on DockerHub and see what the scanner finds.

For Wordpress, most scanners will miss that PHP or Wordpress are even installed in the image. The scanners spit out lots of data, but it's only about what they can find, offering the illusion of completeness or transparency.

Well then I guess scanners need to improve... I mean, the current version of Wordpress (and other software) is being made available as a Docker image because this is faster and more convenient than making it available via the package system, so it kinda makes sense that they are not available (or available much later) via apt/apk/whatever. Calling all other methods of distribution (pulling software from Github or via the various language-specific package managers) "dark matter" expresses the desire of not wanting to deal with that stuff, but surely won't make the "problem" go away.
I guess the point is you could have an open source program in the package manager, that then downloads a closed source binary blob component, that could be doing something undesirable.
I have the exact same confusions and questions as you. I think maybe they consider "dark matter" to be anything for which the source is not publicly available and so cannot be analyzed by security tools that don't have access to the private sources.

I also agree with your "wouldn't it make more sense" definition. From the perspective of a developer concerned about the security and robustness of their own deployment, "dark matter" would be anything that ends up in my container that I don't actually need to run the app in the container.

I also had problems to learn more about that. For me the article creates more confusion than has some theoretic or practical value.