Hacker News new | ask | show | jobs
by chidog12 1283 days ago
"I've grown unfond of this attitude. I most certainly don't own it. I have no IP rights to it at all. We're both being paid to solve different facets of the same problem."

If you are a dev on the team that owns that service then it's you and your team's responsibility to answer all of these questions... Even Org's SOP would end up reaching back to the team who owns the service if problem's arises...

3 comments

I hate separate infrastructure teams with a passion for this reason.

Far too frequently you end up in a situation where someone makes an environment change and blows everything up because they have no understanding of the services they're stewarding.

If you want me to take responsibility, my team should be managing the service end to end.

I feel really strongly against this division of responsibility in software teams. It too often leads to holding up progress and hostile interactions due to each team pursuing their own priorities.

> If you want me to take responsibility, my team should be managing the service end to end.

This. I really do not enjoy being called up in the middle of the night to walk a group of people that know absolutely nothing about the system through the steps they need to resolve the issue, because nobody wants to give the “dev” team access to the production environment.

I think the solution that best aligns incentives is the one where the people introducing issues are also the ones called up (and able) to fix them.

Ah, developers empowered to do operations. We should have a catchy name for it... "opsdevs"? :P

Seriously, this is the original idea of the DevOps principals. But they run straight into CIS requirement that "developers do not have access to production code" and the ISO 27001 v2013 requirement of separation of responsibilities. So it'd be great if it happens, it just can't happen in the big B2B spaces.

We allow devs to do things in prod. We are a public company. Sox, Hipaa, ISO27001, GDPR, and all that. Every dev on my team has access to their prod servers and databases (but no access to other team's stuff usually). We deploy multiple times a day. We handle our own oncall. We process billions of individual requests daily for millions of users. We have several thousand employees.

Our compliance requires that all code be reviewed and pass quality assurance before merging and that all prod changes be documented.

That means Dev1 writes the code, the unit and integration tests, sets the right configs in each environment, updates the dashboards for any updated metrics, sets up alerts, and updates runbooks. Dev2 reviews the work, pushes back when any of the above needs more work, and then documents on the jira ticket how they verified stuff. Dev1 or Dev2 merge the code, observe the build, and ensure the code rolls out to prod.

When something goes wrong, the oncall dev on the team is paged and can access all prod systems, and can log in, start and kill things, move files, etc.

All counsel these days, from 'devops' or 'sre' bodies of knowledge is: development and operations are two sides of the same system, they should be integrated better. Companies: got it, create new title/team, in charge of this integration. Seriously?
Agreed - in a previous job people shipped garbage code frequently and when there was a problem they didn't want to hear about it, because "everyone owned the code".
How else would you get promoted? Get with the program.
When you give responsibility to teams themselves the result is O(1) size problems becoming O(teams) sized problems.

  > How can I check the health of the service?
  In the definition of service, you define a field for
  health check script.

  > How can I safely and gracefully restart the service?
  This will exist within the script used to push new code.

  > Does it has any external dependencies?
  This could be defined in the service configuration and 
  used for setting up integration tests and automatically 
  generating a dependency dashboard.

  > Do you have a playbook, or sequence of steps, to bring
    the service back up?
  You could generate a field in the service defintion to
  automatically generate a dashboard and include the
  playbook link at the top of the page.

  > Do you use appropriate logging levels depending on the
    environments?
  Production could be extremely opinionated about what
  acceptable logging looks like, forced via code review. Log
  level could be defined in service config.

  > Are you logging to stdout?
  Why would any production service get to choose?
  Service owners shouldn't be able to log into machines.
  
  > Are you measuring the RED signals?
  Required fields in service config that could be used to
  generate a service dashboard.

  > Is there any documentation/design specification for the
    service?
  Required config field.

  >  Are you using gRPC or REST?
  Trivial grep.

  > How does the data flow through the service?
  This is complicated, but can probably be easily replaced
  by asking what state your service keeps and how it's
  stored. This is the only question I think the author
  should/needs to ask.

  > Do you have any PII/Sensitive data flowing through the
    service?
  While this question is important, this is one of the
  problems that has to be a particular person's
  responsibility. Any dev that answers anything but
  "probably not, but I don't know" shouldn't be trusted.

  > What is the testing coverage for this service?
  Some form of this would exist in a service config.
I don't think the question of responsibility is as simple as "it's the team's problem."
Hey I see “service config” referenced a lot in that thread, but your answers has the more occurrences.

I’m not sure I follow what it is.

A technical construct, like a code template or a API that services implements ?

Or a process constructs, like a SOP to follow with checkboxes?

Thanks

succinctly: A service config is the authoritative source of truth for what a service is in a format that can be (is) consumed by tooling.

A lot of software development is about generating abstractions.

"Service" is a possible abstraction someone might want to generate and develop.

I think a service abstraction can be defined by:

  A blob of code
  A set of machines to run it on
  A way to stop and start it
  A method to load balance to it
So it would make sense to create a yaml config file committed to a repo containing something like:

  services:
    [
    { 
      name: "CoolAppServerName.prod",
      build_script: "./bin/buildCoolAppServerName.py",
      start_script: "./bin/startCoolAppServerName.py",
      stop_script:  "./bin/stopCoolAppServerName.py",
      hosts:[
        "host_1",
        "host_2",
      ],
      slb_name: "CoolAppServerName.prod",
    },
    {...},
    ]
Once you have a definition, it can be extended to meet growing needs. You might choose to do something like:

    { 
      name: "CoolAppServerName.prod",
      key_metrics: [
        "CoolAppServerName.prod.5xx",
        "CoolAppServerName.prod.latency_percentiles",
      ],
      owner: "CoolTeam",
      ...,
    }
And then you could generate a webpage with a dropdown where "CoolAppServerName.prod" is an option and the dashboard including graphs for the time series metrics "CoolAppServerName.prod.5xx" and "CoolAppServerName.prod.latency_percentiles" automatically show up. Maybe instead of having service names in the dropdown you have owner names in the dropdown.

You could potentially write some code that attempts to validate no significant changes in those metrics and use it to automatically verify that newly pushed code didn't take down the website.

Service config means creating an authoritative service identifier (authoritative because it's the only identifier used in tooling) and then attaching a configuration to it.

Facebook and google have (or at least at some point had) tupperware and borg respectively, that are basically custom verisons of the above extended for their infrastructures.

I see, thanks for the detailed answer.

That furiously remind me of solutions ala kubernetes.

Where you define entry point, healthcheck, etc

A tad more abstract, and larger ( afaik, k8s don’t care how your code is build for instance )

Never heard of Tupperware. Loosely aware of Borg.

Again, I appreciate the time.

When Kubernetes was released, it was thought it would be a successor to borg if not the key components of borg itself, IIRC. https://en.wikipedia.org/wiki/Kubernetes:

  The design and development of Kubernetes was influenced by 
  Google's Borg cluster manager. Many of its top contributors
  had previously worked on Borg;[15][16] they codenamed Kubernetes
  "Project 7" after the Star Trek ex-Borg character Seven of Nine[17]
  and gave its logo a seven-spoked wheel.
There was a lot of early skepticism about it because it was not borg. I guess my understanding is that borg is so integrated into google tooling that it would have been impossible to generalize.

I haven't used it myself yet because a few of the senior engineers (from google/fb) I respect said "absolutely not in our infra."

What are you using instead and what are the main criticisms of kubernetes from your seniors?
"> Do you have any PII/Sensitive data flowing through the service? While this question is important, this is one of the problems that has to be a particular person's responsibility. Any dev that answers anything but "probably not, but I don't know" shouldn't be trusted."

GDPR makes it the responsibility of the organisation to know. You can't safely say "I don't know" about PII.

And if an organization wants to know, then they must make a single individual responsible. "Organizational responsibility" means that no one is responsible.

It is important to have one person know the answer, rather than making your devs "guess" the answer. "The devs we asked said there wasn't misuse of PII" is not at all a good guarantee that PII is not abused or lost.

The organization cannot know unless there is an individual who knows.