| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by supriyo-biswas 385 days ago

I wonder whether OpenAI legal can make the case for storing fuzzy hashes of the content, in the form of ssdeep[1] hashes or content-defined chunks[2] of said data, instead of the actual conversations themselves.

After all, since the NYT has a very limited corpus of information, and supposedly people are generating infringing content using their APIs, said hashes can be used to compare whether such content has been generated.

I'd rather have them store nothing, but given the overly broad court order I think this may be the best middle ground. Of course, I haven't read the lawsuit documents and don't know if NYT is requesting far more, or alleging some indirect form of infringement which would invalidate my proposal.

[1] https://ssdeep-project.github.io/ssdeep/index.html

[2] https://joshleeb.com/posts/content-defined-chunking.html

4 comments

delusional 385 days ago

I haven't been able to find any of the supporting documents, but the court order makes it seem like OpenAI has been unhelpful in producing any alternative during the conversation.

For example, the judge seems to have asked if it would be possible to segregate data that the users wanted deleted from other data, but OpenAI has failed to answer. Not just denied the request, but simply ignored it.

I think it's quite likely that OpenAI has taken the PR route instead of seriously engaging with any way to constructively honor the request for retention of data.

paxys 385 days ago

Yeah, try explaining any of these words to a lawyer or judge.

sthatipamala 385 days ago

The judges in these technical cases can be quite sophisticated and absolutely do learn terms of art. See Oracle v. Google (Java API case)

anshumankmr 385 days ago

As I looked up the judge for this one(https://en.wikipedia.org/wiki/William_Alsup) who was a hobbyist basic programmer, one would need a judge who coded MNIST as a passtime hobby if that is the case.

king_magic 385 days ago

a smart judge who is minimally tech savvy could learn to train a model to predict MNIST in a day or two

fc417fc802 385 days ago

I thought that's what GPT was for.

m463 385 days ago

"you are a helpful law assistant."

landl0rd 385 days ago

"You are a long-suffering clerk speaking to a judge who's sat the same federal bench for two decades and who believes 'everything is computer' constitutes a deep technical insight."

LandoCalrissian 385 days ago

Trying to actively circumvent the intention of a judges order is a pretty bad idea.

Aeolun 385 days ago

That’s not circumvention though. The intent of the order is to be able to prove that ChatGPT regurgitates NYT content, not to read the personal communications of all ChatGPT users.

girvo 385 days ago

Deeply, deeply so. In fact so much so that people who suggest them show they've (luckily) not had to interact with the legal system much. Judges take an incredibly dim view of that kind of thing haha

bigyabai 385 days ago

All of that does fit on a real spiffy whitepaper. Let's not fool around though, every ChatGPT session is sent directly into an S3 bucket that some three-letter spook backs up onto their tapes every month. It's a database of candid, timestamped text interactions from a bunch of rubes that logged in with their Google account - you couldn't ask for a juicer target unless you reinvented email. Of course it's backdoored, you can't even begin to try proving me wrong.

Maybe I'm alone, but a pinkie-promise from Sam Altman does not confer any assurances about my data to me. It's about equally as reassuring as a singing telegram from Mark Zuckerberg dancing to a song about how secure WhatsApp is.

landl0rd 385 days ago

Of course I can't even begin trying to prove you wrong. You're making an unfalsifiable statement. You're pointing to the Russel's Teapot of sigint.

It's well-established that the American IC, primarily NSA, collects a lot of metadata about internet traffic. There are some justifications for this and it's less bad in the age of ubiquitous TLS, but it generally sucks. However, legal protections against directly spying on the actual decrypted content of Americans are at least in theory stronger.

Snowden's leaks mentioned the NSA tapping inter-DC links of Google and Yahoo, so I doubt if they had to tap links that there's a ton of voluntary cooperation.

I'd also point out that trying to parse the unabridged prodigious output of the SlopGenerator9000 is a really hard task unless you also use LLMs to do it.

tdeck 385 days ago

> Snowden's leaks mentioned the NSA tapping inter-DC links of Google and Yahoo, so I doubt if they had to tap links that there's a ton of voluntary cooperation.

The laws have changed since then and it's not for the better:

https://www.aclu.org/press-releases/congress-passing-bill-th...

tuckerman 385 days ago

Even if the laws give them this power, I believe it would be extremely difficult for an operation like this to go unnoticed (and therefore unreported) at most of these companies. MUSCULAR [1] was able to be pulled off because of the cleartext inter-datacenter traffic which was subsequently encrypted. It's hard to see how they could pull off a similar operation without the cooperation of Google which would also entail a tremendous internal cover up.

[1] https://en.wikipedia.org/wiki/MUSCULAR

onli 385 days ago

Warrantlessly installed backdoors in the log system combined with a gag order, combined with secret courts, all "perfectly legal". Not really hard to imagine.

tuckerman 385 days ago

You would have to gag a huge chunk of the engineers and I just don’t think that would work without leaks. Google’s infrastructure would not make something like that easy to do clandestinely (trying to avoid saying impossible but it gets close).

I was an SRE and SWE on technical infra at Google, specifically the logging infrastructure. I am under no gag order.

dmurray 385 days ago

> You're pointing to the Russel's Teapot of sigint.

If there were multiple agencies with billion dollar budgets and a belief that they had an absolute national security mandate to get a teapot into solar orbit, and to lie about it, I would believe there was enough porcelain up there to make a second asteroid belt.

cwillu 385 days ago

> I'd also point out that trying to parse the unabridged prodigious output of the SlopGenerator9000 is a really hard task unless you also use LLMs to do it.

The input is what's interesting.

Aeolun 385 days ago

It doesn’t change the monumental scope of the problem though.

Though I’m inclined to believe the US gov can if OpenAI can.

Yizahi 385 days ago

Metadata is spying (c) Bruce Schneier

If a CIA spook is stalking you everywhere, documenting your every visible move or interaction, you probably would call that spying. Same applies to digital.

Also, teapot argument can be applied in reverse. We have all these documented open digital network systems everywhere, and you want to say that one the most unprofitable and certainly the most expensive to run system is somehow protecting all user data? That belief is based on what? At least selling data is based on evidence of the industry and on actual ToS'es of other similar corpos.

jstanley 385 days ago

The comment you replied to isn't saying that metadata isn't spying. It's saying that the spies generally don't have free access to content data.

rl3 385 days ago

>However, legal protections against directly spying on the actual decrypted content of Americans are at least in theory stronger.

Yeah, because the definition of collection was redefined to mean accessing the full content already stored on their systems, post-interception. It wasn't considered collected until an analyst views it. Metadata was a laughable dog and pony show that was part of the same legal shell games at the time, over a decade ago now.

That said, from an outsider's perspective it sounded like the IC did collectively erect robust guard rails such that access to information was generally controlled and audited. I felt like this broke down a bit once sharing 702 data with other federal agencies was expanded around the same time period.

These days, those guard rails might be the only thing standing in the way of democracy as we know it ending in the US. AI processing applied to full-take collection is terrifying, just ask the Chinese.

zer00eyz 385 days ago

> However, legal protections against directly spying on the actual decrypted content of Americans are at least in theory stronger.

This was the point of the lots of the five eyes programs. Its not legal for the US to spy on its own citizens, but it isnt against the law for us to do to the Australians... Who are all to happy to reciprocate.

> Snowden's leaks mentioned the NSA tapping inter-DC links of Google and Yahoo...

Snowden's info wasn't really news for many of us who were paying attention in the aftermath of 9/11: https://en.wikipedia.org/wiki/Room_641A (This was huge on slashdot at the time... )

komali2 385 days ago

There's no way to know, but it's safer to assume.

Workaccount2 385 days ago

My choice conspiracy is that the three letter agencies actively support their omnipresent, omniknowing conspiracies because it ultimately plays into their hand. Sorta like a Santa Claus for citizens.

bigyabai 384 days ago

> because it ultimately plays into their hand.

How? Scared criminals aren't going to make themselves easy to find. Three-letter spooks would almost certainly prefer to smoke-test a docile population than a paranoid one.

In fact, it kinda overwhelmingly seems like the opposite happens. Remember the 2015 San-Bernadino shooting that was pushed into the national news for no reason? Remember how the FBI bloviated about how hard it was to get information from an iPhone, 3 years after Tim Cook's assent to the PRISM program?

Stuff like this is almost certainly theater. If OpenAI perceived retention as a life-or-death issue, they would be screaming about this case from the top of their lungs. If the FBI percieved it as a life-or-death issue, we would never hear about it in our lifetimes. The dramatic and protracted public fights suggest to me that OpenAI simply wants an alibi. Some sort of user-story that smells like secure and private technology, but in actuality is very obviously neither.

7speter 385 days ago

Maybe I’m wrong, and maybe this was discussed previously, but of course openai keeps our data, they use it for training!

nl 385 days ago

As the linked page points out you can turn this off in settings if you are an end user or choose zero retention if you are an API user.

justacrow 385 days ago

I mean, they already stole and used all copyrighted material they could find to train the thing, am I supposed to believe that thry wont use my data just because I tick a checkbox?

stock_toaster 385 days ago

Agreed, I have hard time believing anything the eye scanning crypto coin (worldcoin or whatever) guy says at this point.

Jackpillar 384 days ago

I wish I could test drive your brain to experience a world where one believes that would stop them from stealing your data.

rl3 385 days ago

>Of course it's backdoored, you can't even begin to try proving me wrong.

On the contrary.

>Maybe I'm alone, but a pinkie-promise from Sam Altman does not confer any assurances about my data to me.

I think you're being unduly paranoid. /s

https://www.theverge.com/2024/6/13/24178079/openai-board-pau...

https://www.wsj.com/tech/ai/the-real-story-behind-sam-altman...

farts_mckensy 385 days ago

Think of all the complete garbage interactions you'd have to sift through to find anything useful from a national security standpoint. The data is practically obfuscated by virtue of its banality.

artursapek 385 days ago

I’ve done my part cluttering it with my requests for the same banana bread recipe like 5 separate times.

refuser 385 days ago

It was that good?

baobun 385 days ago

gief

bigyabai 385 days ago

"We kill people based on metadata." - National Security Agency Gen. Michael Hayden

Raw data with time-series significance is their absolute favorite. You might argue something like Google Maps data is "obfuscated by virtue of its banality" until you catch the right person in the wrong place. ChatGPT sessions are the same way, and it's going to be fed into aggregate surveillance systems in the way modern telecom and advertiser data is.

farts_mckensy 385 days ago

This is mostly security theater, and generally not worth the lift when you consider the steps needed to unlock the value of that data in the context of investigations.

bigyabai 385 days ago

Citation?

farts_mckensy 384 days ago

-The Privacy and Civil Liberties Oversight Board’s 2014 review of the NSA “Section 215” phone-record program found no instance in which the dragnet produced a counter-terror lead that couldn’t have been obtained with targeted subpoenas. https://en.m.wikipedia.org/wiki/Privacy_and_Civil_Liberties_...

-After Boston, Paris, Manchester, and other attacks, post-mortems showed the perpetrators were already in government databases. Analysts simply didn’t connect the dots amid the flood of benign hits. https://www.newyorker.com/magazine/2015/01/26/whole-haystack

-Independent tallies suggest dozens of civilians killed for every intended high-value target in Yemen and Pakistan, largely because metadata mis-identifies phones that change pockets. https://committees.parliament.uk/writtenevidence/36962/pdf

brigandish 385 days ago

Search engines have been doing this since the mid 90s and have only improved, to think that any data is obfuscated by its being part of some huge volume of other data is a fallacy at best.

farts_mckensy 385 days ago

Search engines use our data for completely different purposes.

yunwal 385 days ago

That doesn’t negate the GPs point. It’s easy to make datasets searchable.

farts_mckensy 385 days ago

Searchable? You have to know what to search for, and you have to rule out false positives. How do you discern a person roleplaying some secret agent scenario vs. a person actually plotting something? That's not something a search function can distinguish. It requires a human to sift through that data.