So now instead of sending the data to OpenAI we send it to Cape?
I know that you "promise to keep it secure", but I can only trust you, right? Something like this should IMO be done on-premise
Yes, this has been my #1 issue with all the VC-backed startup dollars flowing lately. They are all 100% reliant on OpenAI and are just shuttling private information and pretending OpenAI's terms are good enough protection.
So far, most we have spoken to are literally SHOCKED that we require SOC3 (one company even told me they'd never even heard of SOC3) and everything needs to be hashed before it goes out and be mapped on our end back to actual. They think we're being too cautious and are really trying to get to sale without understanding that it's literally NOT something we can do and NO ONE else should be doing it either.
Good points. I think the rabbit hole of OpenAI sub-processors is not commonly understood.
The humans at TaskUS are moderating prompts, and then you have Azure, CloudFlare, and Snowflake as sub-processors, each with their own list of sub-processors and on and on.
Yep! The more you can do locally the better. An entirely local LLM is the best for data privacy and security. Any time data leaves it poses some risk.
The de-identification itself requires a complex language model, which has its own complexity and costs to operate. At Cape we're going as far as we can to offer a secure API that's self-serve and easy to use to make these feature accessible to developers, but it does require trust in Cape and the underlying AWS Nitro Enclaves that we use. Client-side attestation is a security feature that can help provide cryptographic verification to the client of the secure enclave. But local is always better when possible!
I will add that running your own private LLM is complicated and costly; and that private LLM (at this point) will not be as capable as GPT-4. So while running a private LLM will certainly be the right solution for some, Cape's offering makes improved privacy available to many.
I want less parties involved with secure data, not more. This should be an on-prem solution with no external network access and no direct calls to OpenAI. A call is made to this service to obfuscate, then another call to OpenAI, all managed by a coordinating mechanism that is opensource / trusted.
Better yet, maybe LLMs should be required to have weights released considering they are trained on the collective of human knowledge. Seems strange to use a significant sum of human knowledge that is publicly available then deny everyone access to the weights.
They are using AWS Nitro instances for their enclaves. These can absolutely be run on-prem with self-hosted licensed software to perform the computational redacting.
So far, most we have spoken to are literally SHOCKED that we require SOC3 (one company even told me they'd never even heard of SOC3) and everything needs to be hashed before it goes out and be mapped on our end back to actual. They think we're being too cautious and are really trying to get to sale without understanding that it's literally NOT something we can do and NO ONE else should be doing it either.