Hacker News new | ask | show | jobs
by troops_h8r 1158 days ago
I don't think I see enough discussion about what this means for privacy. There was some protection in the fact that it was prohibitively expensive to get someone to listen to every single one of our phonecalls/read all our emails/etc.

Worrying that this will no longer be the case.

6 comments

The top use case I've been hearing is in legal discovery. Law firms used to play games with diligence by disclosing TBs of email and making it cost prohibitive to find relevant emails. This task would normally require a $60-100/hr paralegal or lawyer.

GPT-4 can do that task for fractions of a penny per email now. It doesn't have to be perfect if its competing with nothing. I expect we'll see similar shops for any other high cost paper/trail business.

Is there a solution to the issue of data stewardship yet? I'd imagine it typically would not be permissible to send a bunch of proprietary legal documents off to OpenAI.

What I'd really love to implement is a way for GPT-4 to answer questions based on a corpus of "all our Confluence pages plus random other sources of documentation." Like with the legal document issue, it's a bit of a nonstarter right now given the proprietary nature of corporate documentation.

GPT-4 based searchability of my works Confluence would save me a ton of time.

"Hey, has anyone worked on Problem X, and what was the outcome of their project"

AFAIK the Azure APIs provide suitable data usage requirements. One of the most fascinating aspects of the AI world is that we've made extraordinarily expensive brute force search a valuable tool.
Why do you think there was a mass surveillance of American domestic communications since forever ago, as leaked by Snowden? This technology has been available since then and can effectively summarize millions of pieces of communication.
Yeah but have you seen the leaked slides? It's clear that they have only the ability to analyse 10% or less of the data they are storing.

GPT-like systems will close that gap, and then comes all of the problems of automated law enforcement - Extrapolation from incomplete data, false positives from coindicences, interpretation errors, all that annoying stuff

Just wait until they go back and have the bandwidth to finally analyse all the historical data they've accumulated...
I’m guessing they are, and have been, for a while.
To add to that, the leaked documents from Snowden described a query language not unlike doing Boolean searches. Nothing close to GPT’s ability to comprehend human asks.
Yes but you can get GPT to write structured queries based on natural language. It works very, very well at turning normal phrases into SQL.
Collection an ocean of data from everyone isn't the same as actually painstakingly tying all the pieces together for everyone.

They've created a huge library of unorganized data. The difference here is they now can spawn a million untiring AI private investigators / librarians to organize this information into coherent "case files".

At least for me, until this point I've had a feeling of anonymity in the idea that, while my data is being slurped up, I'm just one data point in a sea of other 'normal' people. There would be little value in spending government time and effort tying all of the web detritus together for me. The juice would definitely not be worth the squeeze.

However, when the cost of this effort is nearly zero, that now becomes a different story. The balance of power between government and the people it rules is going to radically shift.

Not exactly. Gov had to be selective because its surveillance required a lot of resources per person/call. New technology allows it cheap and en mass. Voice calls can be recorded, then converted to text, then filtered. And humans will only analyze something of interest. Like we did have alphabet and books, and newspapers for hundreds of years. But only with internet we got the ability to process them easily.
Not only converted to text, it seems likely that we can document sentiment around the persons speech. For example if you're a low priority target that's still on the radar, but not high enough on the list to get a human handler I could see something like, not only what you said, but were you laughing, angry, crying. The the tone of your voice indicate the likelihood of action in a short time frame?
> Gov had to be selective because its surveillance required a lot of resources

not exactly. Where do you think all those budget trillions that don't have to be accounted for goes into? the FBI+NSA (=CIA but for citzens) have infinite resources.

All the overhead they have is to make sure a small subset of the citizens are not impacted. Snowden goes into this in some detail when talking about day to day operations. The norm is to extend the net as wide as possible, until you reach some politician or government agency.

> This technology has been available since then

No, it hasn't lol.

How sure are you about that? The basic theories have been around since the 70s, have been proven at scale in the last decade, and the NSA has more data and compute than anybody else. I’d be shocked if they aren’t very far along in solving many problems.
This isn’t really a “throw money at it” problem like the government is good at.

Take drones for example. The government got really good at those because they made them jet-powered (lol) and blew a bunch of money on server-grade FPGA’s in each one of them.

You can’t really just buy a lot of GPUs to make an LLM work, you need iterative development of architecture and training methods.

Like maybe the government invented self-attention before 2017, but if they didn’t, then the constraint is training time, and the government has the same number of seconds as the rest of us.

Did you like, forget the the military invented the nuclear bomb, semiconductors, coding, AI, NLP, the INTERNET.

Everything you use is from the military.

The government is good at lying, and making themselves ‘appear’ incompetent.

They secretly probably have a much further advanced quantum computer. Your viewpoint is limited to mainstream technology and mainstream science.

This is such an interesting take.

The military invented the nuclear bomb yes. But Fermi did most of his thinking work in Italy before the Manhattan project. He got money thrown at him once he got here.

As for semiconductor devices, it was Bell Labs and TI.

Coding is an ambiguous concept that wasn’t really invented, but if it were, it would have first appeared in programmable looms.

The military likes to take credit for things, but really all they do is throw money at existing inventions.

I’m sure they’re throwing a bunch of money at Transformers now, but who are all these uncredited super geniuses who invent things and then let randos at Google take the credit/earn the money?

I wouldn’t rule out NSA being ahead of the curve, but you have a good point re: GPUs. Likely another factor in the CHIPS act.
Pretty sure.

Even the Manhattan project had nuclear research going on in public universities at the time.

Nothing of the sort here for the attention mechanism which underpins LLMs we know today.

Fundamental research isn't something you just throw money at and acquire. All we had back then were cleverbot and other expert systems.

More my point is they have as big a research budget as a corporate lab.
The military invented AI and NLP which underpins LLMs.

The military is responsible for most the technology we use and talk about today. The government may appear incompetent, but we’re living off military hand-me-downs, the entire world is

Your argument reminds me of the discussions about the moon landing.

If today's hardware was available 20 yrs ago, this would've been possible just like the moon landing could've been faked if it took place 20+ yrs later. The technology wasn't available at the time (GPUs in this case, and generally no experience in doing such advanced trick techniques for movies back then)

These models are having such a strong effect now because we've finally got the hardware to run them

I'm not so sure about that. I mean, maybe not since the Snowden leaks but how do we know that governments haven't been running their own LLMs for the last five years or so? We know that they're using sockpuppets[1]. We know that they're astroturfing[2]. Integrating LLMs into their toolkits seems like an obvious move, so obvious that they would be stupid not to do it.

[1] https://www.theguardian.com/technology/2011/mar/17/us-spy-op...

[2] https://boingboing.net/2015/06/22/gchqs-psy-ops-squad-target...

>haven't been running their own LLMs for the last five years

Because the hardware has not existed.

This said by accident I've seen hardware that was brought to a testing company by federal marshals that was massively parallel custom hardware that was likely for signal processing a lot of channels at once. So there is plenty of custom hardware out there, but these items have not been produced at the scale needed (from what anyone can tell) and, again from what we can tell, they don't have the general processing capability that GPU/TPU driven LLMs have.

Yes, it has. Consumer-wise, we've had Dragon Naturally Speaking since the late 90s. It's pretty simple to have a script read what it outputs text-wise and look for key words. No AI is even needed to do this.
Gaussian transcription models are old, but they also are AI.

They are not deep learning/neural nets.

Also fun fact as a pedant tax: Symantec is so named because they started out as transcription software, hit a wall, and pivoted to security SW.

I think that was metadata and not actual audio of conversations.
Man, I can't wait for the AI to start hallucinating crimes.
Of all the futurisms in Minority Report I really didn't think this one would show up so early.
Shotspotter is a thing and it already has been doing that for years (both on its own and on law enforcement request.)
Shotspotter isn't an AI though is it? I thought it was just triangulation of gunshot locations using microphones and synchronized clocks?
> Shotspotter isn’t an AI though is it?

Shotspotter has been billed as a “system of sensors, software, AI and expert human review that accurately detects, locates and alerts police to gunfire”, and the company behind it (formerly “Shotspotter” was the company name, its recently been renamed “Soundthinking”) has a number of other AI-involved law enforcement products now, as well.

That's a layman explanation. ShotSpotter is likely a passive radar system. In recent years, you can combine signal processing and supervised learn (neuralnet) to get better direction-of-arrival estimations.
Already started. At least two reported cases.
I mean, if we get to the point AI is pointing the finger at someone i hope that a human will double check it at least.
Turns out having humans in the loop doesn't help that much:

https://www.cnn.com/2021/04/29/tech/nijeer-parks-facial-reco...

How long before the AI built in the US figures out its reward function is 80% more likely to be satisfied if it points its digital finger at someone black rather than the most likely subject?
They are proposing regulatory recourse here https://www.whitehouse.gov/ostp/ai-bill-of-rights/human-alte...
You seem to have not heard about the way they do things in the US
> There was some protection in the fact that it was prohibitively expensive to get someone to listen to every single one of our phonecalls/read all our emails/etc.

That's already how it worked on platforms like mturk and uhrs, lots of the work was transcribing audio dumps from microphones built into computers/phones/smart home devices. UHRS especially had a lot of that (it's owned by MS) as well as search engine grading type work. They also certainly do not pay well, I'd imagine that in practice there isn't much cost difference to paying a bunch of bored people to do it vs the compute cost for running an AI model to do it, but the AI model will be vastly more accurate and will work 24/7.

Now that is something I hadn't considered. Woah.
Not to sound condescending but really? How is this not immediately your mind goes? Every piece of information ever recorded can now be summarized and cross-linked efficiently. Privacy is beyond dead. Soon every authoritarian government (and Democratic ones albeit secretly) will have integrated platforms that track every single one of your movements, known contacts, internet usage, financial data, and correspondence. Big Brother has NEVER EVER been more effective than it will become.
Yeah, I think the NSA is going to get their money's worth for that Utah Datacenter that they started building like 20 years ago.
Looking at this from far away, with the Snowden revelations in mind I'd think it's not tinfoil hat territory to assume that some of the progress at OpenAI got achieved with some help from well ressourced folks in the USG/Three Letter Agencies.
I don’t think they helped them. Now, did they train off the same data sources? Well, since OpenAI isn’t saying what GPT-4 is trained on, and the NSA can hoover up all kinds of non-public data, it stands to reason they may both be doing something slightly shady with emails, texts, and the like.
Yeah, I've started wondering similar things about that too, like how far ahead is the NSA on this stuff? And how does that tie in to the recent policy of denying China semiconductors?

Perhaps history will show that the NSA made algorithmic breakthroughs a few years ago and realized what was coming, so political policy was crafted to stymie Chinese progress in this field, and what we're seeing in the public sphere from companies like openAI is a managed release of the technology into the public, openAI at least managing to independently discover the same breakthroughs that the NSA made a a few years ago.

You're seeing the government entity as separate from the corporate entity, but quite often in the US its the other way around. The government entity is a rather hollow shell, and the 'brains' of the operation is contracted out to the corporation. The government entity would almost cease to exist if the corporation under it magically disappeared.
I doubt it. Scraped tweets and reddit is already huge
>How is this not immediately your mind goes?

Most people don't really think about things that don't affect their day-to-day lives. This includes the specifics of how Governments might run a mass surveillance plan.

I know right! It's so obvious.
With regard to privacy, what’s the difference between your email’s text stored on a server, and your email’s text alongside the output of the text processed through a LLM? If “they” can already look at the text, what more privacy is there to lose?
There's a great deal of privacy in simply being a needle in a haystack. Part of the processing that's possible with an LLM is filtering.

Imagine you've sent an email about transporting a friend's daughter across state lines to get a medically-necessary abortion. Or if you prefer, imagine you've arranged via email to "lose" some firearms which don't comply with your state's new assault weapons ban.

Pre-LLMs, trying to find these sorts of emails was very hard. A simple text search for "abortion" or "gun" is going to come up with far more emails where two family members got into a political debate, than emails about lawbreaking. Big Brother will find a few such emails here and there by chance, but the vast majority of such incriminating emails will simply be lost in the pile.

Enter LLMs, and Big Brother can feed some of the incriminating emails found my chance into a training dataset along with a bunch of non-incriminating emails, and teach the AI to find incriminating emails, and then apply the model to the entire list of emails and get a nicely filtered list of only the emails which are incriminating, further tuning the model by adding emails it gets wrong to the training dataset when they are found.