Hacker News new | ask | show | jobs
by freeamz 417 days ago
>I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.

2 comments

"Most company with decent management also would not want their data going to anything outside the physical server they have in control of."

I don't think that's been true for over a decade: AWS wouldn't be trillion dollar business if most companies still wanted to stay on-premise.

Yeah, this has been confusing me a bit. I'm not complaining by ANY means, but why does it suddenly feel like everyone cares about data privacy in LLM contexts, way more than previous attitudes to allowing data to sit on a bunch of random SaaS products?

I assume because of the assumption that the AI companies will train off of your data, causing it to leak? But I thought all these services had enterprise tiers where they'll promise not to do that?

Again, I'm not complaining, it's good to see people caring about where their data goes. Just interesting that they care now, but not before. (In some ways LLMs should be one of the safer services, since they don't even really need to store any data, they can delete it after the query or conversation is over.)

It is due to the risk of a leak.

Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.

Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account. It's safer to have a complete ban on providers that may collect data for training.

Their entire business model based on taking other peoples stuff. I cant imagine someone would willingly drown with the sinking ship if the entire cargo is filled with lifeboats - just because they promised they would.
How can you be sure that AWS will not use your data to train their models? They got enormous data, probably most data in the world.
Being caught doing they would be wildly harmful to their business - billions of dollars harmful, especially given the contracts they sign with their customers. The brand damage would be unimaginably expensive too.

There is no world in which training on customer data without permission would be worth it for AWS.

Your data really isn't that useful anyway.

> Your data really isn't that useful anyway

? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...

Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.

In Scandinavian financial related severs must in the country! That always sounded like a sane approach. The whole putting your data on saas or AWS just seems like the same "Let's shift the responsibility to a big player".

Any important data should NOT be in devices that is NOT physically with in our jurisdiction.

Or GitHub. I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub. All big llm providers offer no-training-on-your-data business plans.
> I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub

What amuses me even more is people thinking their code is too unique and precious, and that GitHub/Microsoft wants to steal it.

Concern about platform risk in regard to Microsoft is historically justified.
Unlikely they think Microsoft or GitHub wants to steal it.

With LLMs, they're thinking of examples that regurgitated proprietary code, and contrary to everyday general observation, valuable proprietary code does exist.

But with GitHub, the thinking is generally the opposite: the worry is that the code is terrible, and seeing it would be like giant blinkenlights* indicating the way in.

* https://en.wikipedia.org/wiki/Blinkenlights

Regulations sometimes matter. Stupid "security" rules sometimes matter too.
While none of that is false, I think there's a big difference from shipping your data to an external LLM API and using AWS.

Using AWS is basically a "physical server they have control of".

That's why AWS Bedrock and Google Vertex AI and Azure AI model inference exist - they're all hosted LLM services that offer the same compliance guarantees that you get from regular AWS-style hosting agreements.
As in aws is a much bigger security concern?
AWS has a strong track record, a clear business model that isn’t predicated on gathering as much data as possible, and an awful lot to lose if they break their promises.

Lots of AI companies have some of these, but not to the same extent.

> "Most company with decent management also would not want their data going to anything outside the physical server they have in control of."

Most companies physical and digital security controls are so much worst than anything from AWS or Google. Note I dont include Azure...but a physical server they have control of is a phrase that screams vulnerability.