Hacker News new | ask | show | jobs
by nerdponx 1689 days ago
The problem I see is that someone will inevitably update the procedure (or make a change that unknowingly requires a change in the procedure) and not update the script. Either because they are pressed for time or because they forgot. Same as any other documentation.

The solution ultimately is for PMs to get it into their heads that software and infrastructure require maintenance like anything else, and consistently refusing to schedule time for software/dev-tool maintenance (such as updating documentation) has the same effect as refusing to schedule time for physical equipment maintenance. Then and only then do engineers have the freedom to set up mandatory procedures and checklists for their work, the way all engineers should be allowed and encouraged to do.

1 comments

> The problem I see is that someone will inevitably update the procedure (or make a change that unknowingly requires a change in the procedure) and not update the script

why would your procedure be to do anything _other_ than "run script foo and do what it says"? If your procedure is not that, then your procedure doesn't reflect reality, and thus is outdated documentation that needs to be updated.

if the steps of the procedure only exist within the script then there's only one place to update it. And yes, this suggests the script should be very readable.

> If your procedure is not that, then your procedure doesn't reflect reality, and thus is outdated documentation that needs to be updated.

Configurations change all the time. There is no technological safeguard against someone forgetting to write down the change in the playbook script; it has to be organizational.

Declarative configuration management systems solve this by unchanging your configuration after someone messes with it manually. :) Hard to forget to change the automation when it persistently undoes all your hard labour.

You can help solve the problem with technology, you just have to make the solution easier than working around it.

> Declarative configuration management systems solve this by unchanging your configuration after someone messes with it manually

Not always, there are frequently ways to do an "end-run" around tools like Puppet and Ansible; take for example the following list of /etc/*.d directories on a Redhat distribution:

/etc/bash_completion.d

/etc/binfmt.d

/etc/chkconfig.d

/etc/cron.d

/etc/depmod.d

/etc/dracut.conf.d

/etc/gdbinit.d

/etc/grub.d

/etc/init.d

/etc/krb5.conf.d

/etc/ld.so.conf.d

/etc/logrotate.d

/etc/lsb-release.d

/etc/modprobe.d

/etc/modules-load.d

/etc/my.cnf.d

/etc/pam.d

/etc/popt.d

/etc/prelink.conf.d

/etc/profile.d

/etc/rc0.d

... <snip> ...

/etc/rc6.d

/etc/rc.d

/etc/rsyslog.d

/etc/rwtab.d

/etc/statetab.d

/etc/sudoers.d

/etc/sysctl.d

/etc/tmpfiles.d

/etc/xinetd.d

/etc/yum.repos.d

Someone can manually log onto the environment and drop additional configuration files into those directories that vastly effect what is run on the system (and when it's run in the case of cron.d for example).

"Idempotency" tools like Puppet and Ansible are very good at saying, "this file should exist in this directory with this MD5 hash", but not as good at saying "this directory shouldn't contain anything except these files".

Of course you can list all the files out that you consider to be valid and their signatures in the above directories, but that's going to break next time Redhat pushes an update that installs/removes files from those directories.

I guess you could setup an audit script that checks that all the files in those directories match the expected RPM signatures, and then account for any local customisations (additions, removals, changes etc). But you are starting to get into a lot of extra work there.

Point I am making, is that these tools are not as forcibly idempotent as a lot of people assume.

Of course; no tool is perfect. But in the general case, they're good enough, and they do help.

For example, I manage nodes with Puppet, and Puppet can and will "clean" things like sudoers.d, yum.repos.d, nginx.conf.d etc. of files that it does not manage.

I don't do this for every possible directory because so far configuration drift in those has not been a problem and generally whatever comes from the packages by default functions fine and crucially, the system can be rebuilt from scratch using the configuration that is managed, so the important bits are there.

I will simply start managing more directories as needed.

A script comparing the md5 or the timestamp of the configuration files against the md5 or the timestamps of the log entry in charge of these files can do that

I mean, if /etc/hosts is more recent than /log-directory/03-static-hosts-in-etc or the md5 you have recorded for this file, a daemon can easily create a ticket / send an email to whoever was logged at the time of the change.