Hacker News new | ask | show | jobs
by quickthrower2 1004 days ago
Two of the things that make me cringe are mentioned. Pickle files and SAS tokens. I get nervous dealing with Azure storage. Use RBAC. They should depreciate SAS and account keys IMO.

SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.

My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.

If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.

4 comments

Many SOC2 audits are a joke. We were audited this year and were asked to provide screenshots of various categories (but most being of our own choosing in the end). Only requirement was screenshots needed to show date of the computer on which the screenshot had been taken, as if it couldn't be forged as well as the file/exif data.
If you forge your SOC2 evidence you will legitimately wish you were never born once caught
We aren't doing that. I just mention the lazyness of the auditors and that asking for screenshots is just dumb. At this point you can just ask a simply question: do you comply or not?
Pickle files are cringe, but they're also basically unavoidable when working with Python machine learning infrastructure. None of the major ML packages provide a proper model serialization/deserialization mechanism.

In the case of scikit-learn, the code implementing some components does so much crazy dynamic shit that it might not even be feasible to provide a well-engineered serde mechanism without a major rewrite. Or at least, that's roughly what the project's maintainers say whenever they close tickets requesting such a thing.

You should check out safetensors. They are used widely in diffusion models and LLMs https://huggingface.co/blog/safetensors-security-audit
ONNX[0], model-as-protosbufs, continuing to gain adoption will hopefully solve this issue.

[0] https://github.com/onnx/onnx

ONNX is cool, but it still only supports a minority of scikit-learn components. Some of them simply aren't compatible with ONNX's basic design.
at work we use the ONNX serialisation format for all of our prod models. Those get loaded by the ONNX runtime for inference. works great.

perhaps it's be viable to add support for the ONNX format even for use cases like model checkpointing during training, etc ?

Absolutely, RBAC should be the default. I would also advocate separate storage accounts for public-facing data, so that any misconfiguration doesn't affect your sensitive data. Just typical "security in layers" thinking that apparently this department in MSFT didn't have.
So SAS tokens are worse that some admin setting up "FileDownloaderAccount" and then sharing its password with multiple users or using the same for different applications?

I take SAS tokens with expiration over people setting up shared RBAC account and sharing password for it.

Yes people should do proper RBAC, but point a company and I will find dozens "shared" accounts. People don't care and don't mind. When beating them up with sticks does not solve the issue SAS tokens while still not perfect help quite a lot.

FileDownloaderAccount had no copy pastable secret that can be leaked. Shared passwords are unnecessary of course and not good. If people are going to do that just use OneDrive/Dropbox rather than letting people use advanced things.