Hacker News new | ask | show | jobs
by martius 1176 days ago
Disclaimer: SRE at Google, not on Drive.

I didn't look into this and don't know what really happened, but I can guess.

It's more likely a tradeoff. Either you set this limit and protect your service from an identified scalability limit with the current architecture, or you plan for a (possibly long and expensive) redesign to get rid of this bottleneck.

My guess is that a group of SRE and devs identified the risk, listed their options, evaluated the impact on users (eg: which fraction of users have more than 5M files) and assumed that the change would be mostly unnoticed and that the error message would be enough to push users with more than 5M files in their drive account to do some cleanup.

This group of people underestimated the impact and maybe chose to not involve product managers. It's also possible that the decision was rushed because of recent layoffs or any other random event which pushed engineers to act quickly.

With the bad press, leadership got involved, decision to rollback was taken. In my team, we would have a retrospective doc to discuss the issue (not exactly a postmortem, as this process has specific requirements which would not be applicable to this case).

I think this is a easy mistake to do even with very good intentions, and I can see myself doing it.

2 comments

> I think this is a easy mistake to do even with very good intentions, and I can see myself doing it.

Hard disagree. You missed one very important part in your writeup - at no point did they communicate that they were imposing this limit, and that this limit appearead, undocumented, overnight.

I was someone who was directly impacted by this change. We're a 40 person company who used (past tense) GDrive as a shared network drive, including for storing builds of our app. We pay $18/person, and as part of that, google workspace advertises 5TB per user pooled[0], and nowhere in the google docs does it mention that this limit will exist [1]. If I was aware of a limit, we would have cleaned up our old files, but instead we started getting spurious 403's - as far as we could tell we were well within our usage limits. It was only when https://issuetracker.google.com/issues/268606830?pli=1 this post hit HN, I realised what was wrong.

[0] https://workspace.google.com/intl/en_us/pricing.html

[1] https://support.google.com/a/users/search?q=Drive%20limits

> at no point did they communicate that they were imposing this limit, and that this limit appearead, undocumented, overnight.

Not to defend google but I've seen plenty of engineers make such mistakes, and you probably have as well; it's just that it didn't then result in bad press.

When you are an engineer working on a living product, and you identify some performance-related issue, changes you make to the product can easily be classified as bugfixes. For example, you identified an end point that should have a rate limit and didn't; you fixed it, it was a potential security issue, it didn't need communication to end users... as far as you knew, even if you misjudged.

Strong XKCD1172 vibes here. https://xkcd.com/1172/

> When you are an engineer working on a living product, and you identify some performance-related issue, changes you make to the product can easily be classified as bugfixes. For example, you identified an end point that should have a rate limit and didn't; you fixed it, it was a potential security issue, it didn't need communication to end users... as far as you knew, even if you misjudged.

Sure, and the vast majority of companies publicise changes that affect customers. Infact, google do it quite regularly. If you roll out a customer impacting change, even if it's a small number, you communicate it. Docker's recent (mis) communication is a good example of what's required. If you can't take the heat, get out of the kitchen.

I've still yet to see google acknowledge that they've done or rolled this back. Even this HN topic is about a tweet that says:

> We recently rolled out a system update to Drive item limits to preserve stability and optimize performance. While this impacted only a small number of people, we are rolling back this change as we explore alternate approaches to ensure a great experience for all.

i.e. not that they imposed an undocumented limit.

> was a potential security issue, it didn't need communication to end users...

If there's a security issue in a customer facing part of a product, and you change that part to introduce a limit, you communicate that you've done that. Coming in one day to find out that the rate limits have changed and you've not been notified about it is a sure fire way to piss off a whole bunch of people.

> For example, you identified an end point that should have a rate limit and didn't; you fixed it, it was a potential security issue

That sounds careless. Any such change would need to have a impact analysis (which should be part of the team/org/company's SDLC). In this case, communication should be sent out to the clients of that endpoint, with a reasonable deadline, before enforcing any rate-limit.

You see this if you troll through kernel logs or any enterprise piece of software; pages and pages of warnings like this:

BOBFLANGLE IS DEPRECATED AND MAY BE REMOVED, PLEASE REDUCE THE BOBFLANGLE USAGE BELOW 1.5 MILLIBOBS

Then if it actually becomes an issue, you can pull logs from thousands/millions of systems, and determine the extent of actually removing the BOBFLANGLE and begin mitigation.

To add to the point: 5M files on 5TB storage would average 1MB per file

So just storing 5TB of average web sized images would hit the limit, let alone smaller documents.

Is PM not consulted for changes like this? I feel like this is something that a competent PM would ask eng to pause on while they determined actual user impact and lines up messaging.
Short answer, yes, a PM should have been consulted and give their approval for the change.

In the scenario I described above (again, it's a guess, I don't know what actually happened), it's possible that the PM was bypassed because the engineers for a reason they thought was good. For example:

* they didn't even think about involving a PM because that's something you have never done,

* the people who wrote/reviewed the change assumed the conversation already happened,

* they were pressured to move fast to mitigate an imminent or existing problem (performance, scalability, ...),

* the PM who should have made this call "left" the company and didn't (get a chance to) hand-off their responsibilities.

I guess what I mean with these comments is that sometimes there are misses like this. Maybe it's a sign of a systematic failure and that internal processes should be improved, but I don't think it means that the company is fundamentally unable to handle these changes correctly or can't make the right technical or product decisions.