Hacker News new | ask | show | jobs
by imh 3886 days ago
I appreciate the privacy standards they used (no humans reading your email to develop this), but am concerned that it's not enough. As I understand with language models, overfitting takes the form of returning a sequence of words seen in the training set. If this is overfitting in any part of the response space, this could happen. Out of a million emails, how many suggested responses are going to substantively resemble another response the original author wouldn't want read by others?
4 comments

Much of this strikes me as a "just because you can, doesn't mean you should" issue. Google clearly loves machine learning and doing cool things but I think lately they've been taking it too far.

For example; after purchasing a book on Amazon recently I happened to do a Google search on that book and the first thing I see is, "Your book is scheduled to be delivered on..." Aside from the creepy factor I'm left wondering what purpose this serves? I just ordered the book. I KNOW it's on its way.

Turns out they just mined my emails from Gmail to provide it in search results.

I'm sure some developer or product manager thought it would be a cool thing to do without giving any consideration to usefulness much less user privacy. I really don't feel like Google needs to know what I'm buying thankyouverymuch. Gmail account: closed.

> Aside from the creepy factor (...)

One man's "creepy factor" is another's superbly useful feature. The feeling of creepiness probably stems from being surprised and defaulting to negative reaction. Remember that GMail was never supposed to be a dumb mailbox. If you want a dumb mailbox, there are tons of alternatives (i.e. almost every other provider and various open-source UI packages).

Honestly, I really enjoy those "creepy" features and want much more. For me, they can, and definitely should.

And please - like Google cares that you ordered that book from Amazon. Until they do.

Abuses of this technology are inevitable, but we haven't seen it yet. It's the source of magical "how did they do that" wow factor in technology that touch screens and thin devices used to have.

Maybe I'm weird, but for me none of this (touchscreens too) is "magical", and all of this is "interesting" and then "obvious" when I learn/figure out how they do it. Maybe that's why some people are afraid - because it's more magic to them?

> And please - like Google cares that you ordered that book from Amazon. Until they do.

Well, if Google starts caring that I ordered a book from Amazon (more than they already do - Google Now shows me info on my purchases, including delivery time), then they'll do what exactly? Tell their self-driving cars to kill me because I didn't use Google Play?

Google so far has a stellar history of being helpful, pro-user, quite often pro-bono at it. Please apply these levels of scrutiny to someone else first, like every other SV startup running on the investor-storytime model.

I don't think the risk is Google harming you for the books you bought, but disclosing the information to a government who may.

For example, a government (China?) may pass a law to force Google to disclose the list of nationals who bought certain books (political book criticising the Chinese government?), and Google may choose to comply to stay in that market.

Except China tried something like that and Google abandoned the country. So we have at least one strong data point they're unlikely to do it now.

But basically, you can draw such arguments about anything. What if the evil government asks my local bookstore for CCTV recordings and credit card recipes? What if they ask my bank?

If your government wants to be evil, they will find a way to do this, regardless of whether people posted their data all over the Internet or not. The problem is with your government and not with the tools they would use in a hypothetical, unlikely scenario of going batshit insane in the nearby future. It's like a country deciding to destroy all roads and bridges because they can be used by an invasion force to quickly overrun the country. Well, they would be, but since you destroyed them your enemy will airdrop soldiers on you in the extremely unlikely future when they decide to invade. In the meantime, you have no roads and bridges.

The fact that none of the other commenters above made this connection is a bit strange for a community of Tech/Internet/Web people. Is all of this really that shiny?
That's cool that you feel that way, but I really enjoy these features. I'm never home and never sitting down so being able to see everything that I need to pay attention to the things I can't remember is amazing. Google Now reminding me of meetings, flight times, hotel check out times, package deliveries, etc, are things a personal assistant would do but without the associated cost of a yuppie's salary.

Just because you don't see it as convenient or even useful doesn't mean no one else does.

I understand why it might seem creepy, at least until you get used to it. But surely you knew that Google's computers were already reading all mail sent to your gmail account. They filter and check for malware and spam based partly on content, and even serve related advertising in the gmail interface (do they still do this?).
Order status and package tracking are extremely popular features of Google Now.
On http://myaccount.google.com you can go into search settings and turn off "Private Results" and it will not return GMail search results.

Edit: It is under Your personal info -> Search settings -> Manage settings

I imagine this could be useful for live demos.
> a "just because you can, doesn't mean you should" issue

To me, this phrase is the essence of much of Google's features. To my discredit, I chuckled when I read the blog's phrase "we've used...deep neural networks to improve...YouTube thumbnails." I am certain this was no easy task, and a resulting technical breakthrough. But doesn't it sound kind of petty?

Of course, what's petty for one is essential for another. I wish every e-mail client had that "undo send" feature, which was just GMail whimsy years ago. Is the line between petty and essential always going to be blurred?

Improving YouTube thumbnails can be a huge usability win Consider a series of lectures or DIY videos with a common setting. Pulling out a frame that captures something unique about the video (be it the DIY item being worked on in close-up or an important theorem on a title slide or blackboard) makes it easier for users to separate content and find specific items.
Then what isn't 'petty'? It's not like it's zero-sum, there are also people working on using AI for recognizing cancer cells on medical imaging, or to manage climate change risk. And if we don't try, we'll never know what is 'petty' and what is useful. And also, we can learn a lot from the 'exercise' we get in developing small-scale applications of machine learning, which can then be applied later to more 'worthy' applications.
Thumbnails drive engagement -- doesn't seem petty to me.
I love this feature.
But you have a lot more to go off here and the number of replies is limited to maybe a few thousand at most. It can quickly determine if it's a scheduling email, check your calendar, and generate responses like "I am available" and "I'm busy". For others it can be as simple as "I'll check it out and get back to you". Finally, if you are expected to review the automatically composed response or choose from several options it's actually not that bad at all. This actually seems a lot like the iOS feature where if you miss hang up on an incoming call you can send a quick SMS reply back saying things like "I'll call you back" or automatically adding a reminder to call back in an hour.
I'm talking about a slightly different problem. I'm not suggesting that you might accidentally click to send a reply you didn't want to share, so you reviewing it is beside the point. I'm suggesting that by mining all our emails, it might make a suggestion to me based on something you didn't want to share.

E.g. Someone writes me an email about a rare kink you often talk about. You're the main data point on that kink, so it suggests I respond with something you often say when you talk about this topic, maybe including personal details. It's not a totally precise or realistic example, but with large numbers and complex models, unintended things are bound to happen on occasion. Will those things leak information?

As for your comment that the potential replies are limited in number and as structured as you say, I don't get that from the original post, and it doesn't quite fit with my understanding of the model.

You raise a very important point. I'd hope that there is an actual finite (and relatively small) corpus of approved, manually white listed answers that transcend through Google accounts. You might get personalized options based on what you write most but they would not show for other people. Would that be enough to satisfy this concern?
Yes, it would. If curation is too troublesome for gathering a large enough training set, it might be possible to train a smaller curated network with a higher false-negative rate that flags responses that aren't appropriate (personal info, insults, etc) and removes those from the training set.
You could also do something like googlebombing. Have lots of people send each other the same question, and all of them reply with the same/similar response.
Hehe. I bet they will filter that out :)
yes, just like they filter out plagiarised copycat websites in pagerank https://news.ycombinator.com/item?id=10493754
Ah, that makes sense. I think the step where you end up reviewing the reply it composed would prevent most of that.
Overfitting requires "memorizing" the dataset, instead of generalizing it. I think that's very very unlikely. The neural network parameters can only store so many bits of information. But the dataset is millions of times bigger.
That's why I wouldn't worry about how it performs in general, but in edge cases. The question isn't whether it's memorizing the whole dataset, but whether it's "memorizing" any particular points it shouldn't. Kinda like when you do a polynomial regression and the ends go more wild than the middle. The predictions in different parts of the space have different variances, some determined more strongly by single data points.

I have no doubt that in the vast majority of the email space, this will do great, but wonder will it leak privacy anywhere at all?

That is a potential problem, especially if there is overfitting, but personal details are not likely to be generalized.