I scanned 2,500 Hugging Face models for malware/issues. Here is the data

Y	Hacker News new \| ask \| show \| jobs

I scanned 2,500 Hugging Face models for malware/issues. Here is the data (github.com)

24 points by arseniibr 146 days ago

Hi HN,

I built a CLI tool called Veritensor for scunning AI models, because I found out that downloading model weights from 3rd party websites and loading them with torch.load() can lead to RCE. At the same time, simple regex scanners are easy to bypass.

To test my tool, I ran it against 2500 new and trending models on Hugging Face.

Here is what I found — 86 failed models: Broken files — 16 models were actually Git LFS text pointers (several hundred bytes), not binaries. If you try to load them, your code crashes. Hidden Licenses — 5 models. I found models with Non-Commercial licenses hidden inside the .safetensors headers, even if the repo looked open source. Shadow Dependencies — 49 models. Many models tried to import libraries I didn't have (like ultralytics or deepspeed). My tool blocked them because I use a strict allowlist of libraries. Suspicious Code — 11 files used STACK_GLOBAL to build function names dynamically. This is a common way how RCE malware hides, though in my case, it was mostly old numpy files. Scan Errors — 5 models failed because of missing local dependencies (like h5py for old Keras files).

I was able to detect some threats because under the hood, Veritensor works differently from common regex scanners. Instead of searching for suspicious text, it simulates how Pickle loads data, which helps it find hidden payloads without running any code. It also checks that the model file is real by hashing it and comparing it with the version from Hugging Face, so fake or changed models can be detected. Veritensor also looks at model metadata in formats like Safetensors and GGUF to spot license restrictions. If everything looks safe, it can sign the container using Sigstore Cosign.

It supports PyTorch, Keras, and GGUF. Free to use — Apache 2.0.

Repo: https://github.com/ArseniiBrazhnyk/Veritensor Data of the scan [CSV/JSON]: https://drive.google.com/drive/folders/1G-Bq063zk8szx9fAQ3NN... PyPI: pip install veritensor

Let me know if you have any feedback, have you ever faced similar threats and whether this tool could be useful for you.

4 comments

embedding-shape 142 days ago

> Broken files — 16 models were actually Git LFS text pointers (several hundred bytes), not binaries. If you try to load them, your code crashes.

Yeah, if you don't know how use the repositories, they might look broken :) Pointers are fine, the blobs are downloaded after you fetch the git repository itself, then it's perfectly loadable. Seems like a really basic thing to misunderstand, given the context.

Please, understand how things typically work in the ecosystem before claiming something is broken.

That whatever LLM you used couldn't import some specific libraries also doesn't mean the repository itself has issues.

I think you need to go back to the drawing board here, fully understand how things work, before you set out to analyze what's "broken".