Hacker News new | ask | show | jobs
Show HN: Cape API – Keep your sensitive data private while using GPT-4 (capeprivacy.com)
29 points by gavinuhma 1085 days ago
We’ve built the Cape API so developers can keep sensitive data private while prompting LLMs like GPT-4 and GPT 3.5 Turbo.

With Cape, you can easily de-identify sensitive data before sending it to OpenAI. In addition, you can create embeddings from sensitive text and documents and perform vector searches to improve your prompt context all while keeping the data confidential.

Developers are using Cape with data like financial statements, legal contracts, and internal/proprietary knowledge that would otherwise be too sensitive to process with the ChatGPT API.

You can try CapeChat, our playground for the API at https://chat.capeprivacy.com/

The Cape API is self-serve, and has a free tier. The main features of the API are:

De-identification — Redacts sensitive data like PII, PCI, and PHI from your text and documents.

Re-identification — Reverts de-identified data back to the original form.

Upload documents — Converts sensitive documents to embeddings (supports PDF, Excel, Word, CSV, TXT, PowerPoint, and Markdown).

Vector Search — Performs a vector search on your embeddings to augment your prompts with context.

To do all this, we work with a number of privacy and security techniques.

First of all, we process data within a secure enclave, which is an isolated VM with in-memory encryption. The data remains confidential. No human, including our team at Cape or the underlying cloud provider, can see the data.

Secondly, within the secure enclave, Cape de-identifies your data by removing PII, PCI, and PHI before it is processed by OpenAI. As GPT-4 generates and streams back the response tokens, we re-identify the data so it becomes readable again.

In addition to de-identification, Cape also has API endpoints for embeddings, vector search, and document uploads, which all operate entirely within the secure enclave (no external calls and no sub-processors).

Why did we build this?

Developers asked us for help! We've been working at the intersection of privacy and AI since 2017, and with the explosion of interest in LLMs we've had a lot of questions from developers.

Privacy and security remain one of the biggest barriers to adopting AI like LLMs, particularly for sensitive data.

We’ve spoken with many companies who have been experimenting with ChatGPT or the GPT-4 API and they are extremely excited about the potential, however they find taking an LLM powered feature from PoC to production is a major lift, and it’s uncharted territory for many teams. Developers have questions like:

- How do we ensure the privacy of our customer’s data if we’re sending it to OpenAI?

- How can we securely feed large bodies of internal, proprietary data into GPT-4?

- How can we mitigate hallucinations and bias so that we have higher trust in AI generated text?

The features of the Cape API are designed to help solve these problems for developers, and we have a number of early customers using the API in production already.

To get started, checkout our docs: https://docs.capeprivacy.com/

View the API reference: https://api.capeprivacy.com/v1/redoc

Join the discussion on our Discord: https://discord.gg/nQW7YxUYjh

And of course try the CapeChat playground at https://chat.capeprivacy.com/

8 comments

So now instead of sending the data to OpenAI we send it to Cape? I know that you "promise to keep it secure", but I can only trust you, right? Something like this should IMO be done on-premise
Yes, this has been my #1 issue with all the VC-backed startup dollars flowing lately. They are all 100% reliant on OpenAI and are just shuttling private information and pretending OpenAI's terms are good enough protection.

So far, most we have spoken to are literally SHOCKED that we require SOC3 (one company even told me they'd never even heard of SOC3) and everything needs to be hashed before it goes out and be mapped on our end back to actual. They think we're being too cautious and are really trying to get to sale without understanding that it's literally NOT something we can do and NO ONE else should be doing it either.

Good points. I think the rabbit hole of OpenAI sub-processors is not commonly understood.

The humans at TaskUS are moderating prompts, and then you have Azure, CloudFlare, and Snowflake as sub-processors, each with their own list of sub-processors and on and on.

https://platform.openai.com/subprocessors

Data breaches can happen, so any data that you throw over the wall to OpenAI you must be willing to accept that it could become public.

Yep! The more you can do locally the better. An entirely local LLM is the best for data privacy and security. Any time data leaves it poses some risk.

The de-identification itself requires a complex language model, which has its own complexity and costs to operate. At Cape we're going as far as we can to offer a secure API that's self-serve and easy to use to make these feature accessible to developers, but it does require trust in Cape and the underlying AWS Nitro Enclaves that we use. Client-side attestation is a security feature that can help provide cryptographic verification to the client of the secure enclave. But local is always better when possible!

I will add that running your own private LLM is complicated and costly; and that private LLM (at this point) will not be as capable as GPT-4. So while running a private LLM will certainly be the right solution for some, Cape's offering makes improved privacy available to many.
Right...

I want less parties involved with secure data, not more. This should be an on-prem solution with no external network access and no direct calls to OpenAI. A call is made to this service to obfuscate, then another call to OpenAI, all managed by a coordinating mechanism that is opensource / trusted.

Better yet, maybe LLMs should be required to have weights released considering they are trained on the collective of human knowledge. Seems strange to use a significant sum of human knowledge that is publicly available then deny everyone access to the weights.

Entirely local and 0 sub-processors is the ideal! I hope we are trending that way as an industry
They are using AWS Nitro instances for their enclaves. These can absolutely be run on-prem with self-hosted licensed software to perform the computational redacting.
Well let me ask the obvious question: Won't this redact data that is obviously crucial to getting the given task done?

Let's say in case of financial statements, it if can't read credit card numbers and names, then it can't tell you which days some credit card was used and by who. Maybe that's not the typical use case, but I would imagine it being very annoying, given the already high typical LLM failure rate.

It's a great question. Redaction limits the LLMs ability to draw on the underlying training data on the subject. This can work to the developers benefit in many cases, like asking questions about your own provided context.

Many developers have gotten away from relying on LLMs for facts, toward providing LLMs with facts and having those facts repurposed.

For example, if you ask an LLM about a famous person, like Wayne Gretzky, it may give you a good answer but there is a chance it may hallucinate key details like the number of points he had in his NHL career.

To combat this, you can provide the LLM with a biography of Wayne Gretzky and you may get more factual answers, but the LLM may still hallucinate if you probe for facts that were not provided.

If you redact his name instead, for example asking “Who is [Name1]?” the LLM will be unable to answer the question without further context. But now, if you provide the redacted biography the LLM can answer the question while relying only on the provided context (the biography will contain information about [Name1]). If the question falls outside of the context the LLM will not be unable to answer, which is often the desired result. In other words, the LLM is unable to rely on the training data about Wayne Gretzky because it is only dealing with [Name1] along with redacted locations, organizations, occupations, etc from the biography about [Name1]. You force the model to rely on the provided facts.

The use cases we see are people providing legal contracts and financial statements where names and currencies get redacted, and the LLM must work with the redacted values and any other context provided.

that's actually pretty brilliant. I can imagine this also being useful for adding a chatbot for a website's content and really trying to limit the responses to only the content from the website as much as possible.
Damn, that is actually a really cool approach.

I suppose most LLMs are not smart enough to make the connection and can be probably told to avoid doing it, but I would imagine that it's not impossible for it to figure out that Name1 is likely Wayne Gretzky from context?

Edit: Yep, it's definitely a problem unless the facts are also anonymized I guess: https://chat.openai.com/share/84dbe124-dca7-46e3-be73-79b194...

I redacted the full wikipedia paragraph with the API. Like, the nickname "The Great One" is a pretty major tell!

[NAME_GIVEN_1] [NAME_FAMILY_1] CC ([NAME_GIVEN_2] [NAME_FAMILY_2]; born [DOB_1]) is a [ORIGIN_1] [OCCUPATION_1] and [OCCUPATION_2]. He played 20 seasons in the [ORGANIZATION_1] ([ORGANIZATION_2]) for four teams from [DATE_INTERVAL_1] to [DATE_INTERVAL_2]. Nicknamed \"the Great One\",[1] he has been called the greatest [OCCUPATION_3] ever by many [OCCUPATION_4], [OCCUPATION_5], The Hockey News, and by the [ORGANIZATION_2] itself,[2] based on extensive surveys of [OCCUPATION_6], [OCCUPATION_7], [OCCUPATION_8] and [OCCUPATION_9].[3] [NAME_FAMILY_1] is the leading goal scorer, assist producer and point scorer in [ORGANIZATION_2] history,[4] and has more career assists than any other [OCCUPATION_10] has total points. He is the only [ORGANIZATION_2] [OCCUPATION_10] to total over 200 points in one season, a feat he accomplished four times. In addition, [NAME_FAMILY_1] tallied over 100 points in 15 professional seasons, 13 of them consecutive. At the time of his retirement in [DATE_INTERVAL_2], he held 61 [ORGANIZATION_2] records: 40 regular season records, 15 playoff records, and 6 All-Star records.[2]

> Based on the information provided, NAME_GIVEN_1 NAME_FAMILY_1, also known as NAME_GIVEN_2 NAME_FAMILY_2, played in the ORGANIZATION_1, which is also referred to as ORGANIZATION_2. He played for four teams within this organization over the course of 20 seasons, from DATE_INTERVAL_1 to DATE_INTERVAL_2.

Hey that's actually pretty good.

You can use CapeChat UI to mess around with it: https://chat.capeprivacy.com/

Or you can also create a free API key here: https://app.capeprivacy.com/api-keys to use the interactive API directly: https://api.capeprivacy.com/v1/docs#/Privacy/DeidentifyText

Click "Authorize" on the top right to add the key, and then click "Try it out" on any of the endpoints.

Exactly. A super famous person like Wayne Gretzky is really hard to protect.

For fun, you can try to tease ChatGPT with information like. "Who is [Name1]?", it won't know, but then add "[Name1] is considered the greatest [Occupation1] in the history of the [Organization1]". Greatest is now a big clue. Add "[Name1] has the most points in history". Points is a big clue, it's some kind of game or sport.. etc. It will eventually figure it out, but I've seen it guess wrong with like Michael Jordan instead.

> De-identification

> Re-identification

Wouldn't these two features address your concern? ChatGPT gets a generated unique ID that is still a consistent value for each card, just not the number itself. Then when the results are returned, that generated ID is turned back into the real card number.

This only becomes a problem when the de-identified data itself is needed to answer a question, like tell me how many Visa cards were used in these transactions by checking the card numbers.

That's right. So in the case of credit card numbers we redact it like [CREDIT_CARD_NUMBER_1], [CREDIT_CARD_NUMBER_2], etc so the LLM can still answer prompts like "how many", but it can't answer prompts like "sort". But you can use OpenAI function calling API to do the sort, where your function re-identifies, sorts, and then de-identifies again.
I checked out Github for more info as suggested, and it seems the main ingredient, https://github.com/capeprivacy/private-ai is forked from udacity/private-ai. Hmmm. I was expecting to find a clever and useful repo to nicely identify and strips out personal info, that's not what it is.

I do think stripping and adding personal info back only when needed is in principle a good idea for some situations. But I have big doubts at the injection of another party into the mix.

Yikes, we'll have to remove that. It's a really old course on privacy-preserving machine learning from 4 years ago and has nothing to do with this product despite the generic name.

Please see https://api.capeprivacy.com/v1/docs#/ for more info.

I can confirm it was forked in the Capeprivacy GitHub repos list. The name is the same as the PII remover mentioned in the link, which I wanted to see how it worked!
Thanks for checking it out! That repo is not related to this project, did you see it on the main list on https://github.com/capeprivacy or somewhere else? We will try to avoid the confusion in the future.
I wonder what SOTA open-source PII stripping libraries there are? Something like https://github.com/microsoft/presidio for stripping out PII might fill the role I expected https://github.com/capeprivacy/private-ai to do.
Cool concept! I do have a concern about the healthcare aspects of the product as advertised. Do you provide a signed BAA for healthcare organizations? Without one, all the healthcare use cases listed on the site are basically a non-starter. Having said that, if you are using AWS as described, you already have a BAA with them, and providing one to clients should not be a huge deal.
Neat idea, but making this a cloud based SaaS makes it useless for us. The docs claim that your company wouldn't see the data, but we'd still be sending unencrypted data to your own team's black box endpoint. We would have to blindly trust your company, but this isn't any better than just blindly trusting OpenAI.
If I were a user or integrator, how do I know that the de-identification step is actually working? Is there a way to test (and/or continue testing) your regex patterns or whatever mechanism used continues to accurately strip my sensitive information before it goes to OpenAI?
Good question, some developers implement a manual approval step, so you can review the redacted prompt before you submit it rather than making it automatic. It depends on their product requirements.

Re mechanism, the redactions themselves are powered by a language model.

How do you deal with data persistence for storing the documents/vectorDB inside a Nitro Enclave? I would assume that you as the SaaS vendor are unable to decrypt the sensitive documents inside the enclave or see a users chat history?
Good question! Uploaded documents get converted to embeddings within the Nitro Enclave (NE), and then the embeddings are encrypted with a key that only the NE has access to.

When the search endpoint is called the encrypted embeddings are pulled into the NE and decrypted. They are then loaded into a vector db in-memory and the search is executed all within the NE. This adds some latency but it’s more secure because embeddings are only accessible to the enclave.

In the case of chat history it is never stored by the API. The developer can develop their own client side. With CapeChat we keep chat history local on the device.