Hacker News new | ask | show | jobs
by tekstar 2396 days ago
Shopify always runs CI on branches before merging to master. Everything this article describes is in addition to that, in order to deal with the problems the article talks about at "merge to master" time, like 2 merged PRs failing or a stale PR that passed on branch but fails on master due changes.

At this scale you need to be deploying constantly, otherwise deploys are hundreds of commits large and its impossible to triage - what PR in the deploy broke something, is it even safe to rollback, etc. That is the primary reason to automate deploys and manage the deploy queue.

2 comments

It smells like a capacity planning error.

What's the minimum residency time to reliably detect problems with my PR? Add deployment time, double to account for jitter caused by humans being humans (forgetful, lunch, meetings, etc), and there probably are not enough hours in the day for 1000 people to be deploying the same monolith.

To increase residency time you can deploy separate units (You can have multiple deployment units even in a monorepo), and those also reduce the surface area of merges.

Honestly what are they doing with 1000 developers? Duplicated effort goes up considerably with a team and codebase of that size. If you forced me to hire that many people, I'd have a lot of them working on open source, trying to steward feature enhancements that help our process. Because otherwise they'd be running around writing proprietary versions of a bunch of shit that already exists and in a better more documented form.

And I'm not even a little surprised:

https://engineering.shopify.com/blogs/engineering/introducin....

Folks, when you hire enough devs, they feel empowered to rewrite the world. I have lived all sides of this phenomenon and rarely is it pretty.

Scaling is a concern that goes in both directions. Shopify has 1000 developers today. How screwed would they be if they suddenly had to drop to 600? Or even if there's a hiring freeze? What happens when the people who wrote these tools go work somewhere else?

When I do tool smithing work these days, it's always with an effort to provide the thinnest of shims around open source or commercial tools with healthy user communities, so that at the end of the day they have a larger pool of resources than what is in house. People move on. Money dries up. Mandates change.

"Being important" in a company is about how much you support new work, not how locked in people are to your old work. If you can't give your old work away then you're shackling yourself, both to your current responsibilities and to the company. I can't believe that I'm the only one who has ever stayed at a company out of guilt for how screwed they'd be if I left. But that quickly turns into resentment which is worse.

If you are important for new work, then you always get new challenges. You stay sharp and your resume looks good. If the company stops doing new work altogether, do you really want to stay there anyway? Plus you could always go back to one of your old projects.

Sorry if my comment was unclear. I consider the queue to be a “branch” as well. Many people use a “develop” branch instead of a queue in this instance. The queue appears designed to allow arbitrary selection rather than merging in order (though the new solution with CD seems generally in order)

Totally agree that CD is required with this many commits. It’s commonplace on teams with many fewer developers. Was surprised to see you folks roll your own workflows rather than using other systems.

Would also be interesting to see if you tag commits that go to master in instrumentation systems so you have visibility into production metrics and can correlate them with what code was running at the time.

Generally our metrics and exception reports are tagged with the sha and the deploy stage.
Good to hear, that’ll make change management less of a chore.

I think the main thing that was missing for me is the rationale behind building this system rather than building a workflow in one of the existing CI/CD tools. Was there a throughout bottleneck in existing tools? Was there something custom about your workflow that wasn’t supported elsewhere? I may be wrong but the workflow you landed upon seems pretty common so I’m curious as to why the need to build and maintain a tool in house for this?