When I was working as a sysadmin, I kept a spreadsheet. I was told later of a repository of information that supposedly was what my spreadsheet did, but it didn't add anything new and was much harder to keep up.
I built it up using nmap and then shelling into each individual machine and poking around to see what it did. This was back in the days before everything became virtualized, so each machine on the network was likely physical.
I added information by walking the aisles and copying down the rack location of every machine into another page on the spreadsheet. I eventually hooked up a terminal to them all and matched network addresses to physical machines.
Only took a few weeks and when I was done, I knew things about the network that guys who worked at the business for years didn't know.
There's no substitute for the good old-fashioned way.
If you're by yourself, using spreadsheets and nmap is usually fine. If you're working in a team of 5 or 10 or 50 sysadmins, spreadsheets turn into a huge mess. You either have to distribute them via mail etc. after every change, but then you will have concurrent edits that need to be merged manually. Or you put the spreadsheets on a network share with file locking, but then it will always be locked when you want to edit it because someone is working on an entirely unrelated part of the infrastructure.
So you have exactly those sorts of problems that RDBMS are designed to solve. Therefore it makes sense to move to a DCIM system using an RDBMS under the hood, that allows for concurrent edits, and also can be accessed by automation (cronjobs, CI, etc.) via some sort of API (or direct DB read access).
There is an even better alternative. You can put infrastructure information into the same version control repository where your infrastructure code lives, and you can even keep all the benefits of spreadsheets by using plain text format spreadsheets like Org-mode tables.
This means you do not have two sources of truth to maintain (what is in the RDBMS, and how that relates to what is in the infrastructure code repository), the RDBMS system does not have to reinvent versioning, you can see exactly how your infrastructure evolves, you can do atomic changes to both the infrastructure code and the infrastructure information that the code relies on (obviously you need a modern version control system for this), and the infrastructure code can access the infrastructure information in a much more straightforward (and much easier to test) way.
This would become very exhausting if working with very large infrastructures. 80 000 virtual and physical servers? Have fun keeping that data consistent, up to date and available with Org-mode and version control.
I'm not saying your example is wrong, but "there is an even better alternative" doesn't always apply. For smaller scales, sure.
VMs need to be kept track of in whatever system you use for provisioning (AWS, OpenStack), otherwise you now have three sources of truth: what the configuration says should be running, what the DCIM thinks is running, and what is actually running.
As a given, we yank test the entire world. If it doesn't pass a yank, it straight up doesn't exist.
Whether it's bare-metal, virtualized, para-virtualized, dockerized, mixed-mode, or cloud - we 100% do this all the time. There is not a single change across any environment, that isn't fully tracked, fully reproducible, fully auditable, and fully automated.
what do you mean by "passing a yank test"? i assume "yank test" refers to unplugging the network cable abruptly from the server under test, but what exactly are you looking for when you do that?
A yank test on process and infrastructure is more than a 'did it come up'. It's a "if we totally nuke the thing" - say, were we to rip the hard drives out of a server, fry it, and recreate it - does it come up identicall(is).
That way we know our CMDB is accurate, our workflows are accurate, credentials, ansible, terraform, images, etc. Right down to tickets.
- configure everything as code (we use Ansible for the infrastructure up to OS level, Kubernetes w/ Helm for applications), have it read the values from the DCIM so that the DCIM remains the single source of truth (we need to still get better on this part....)
I cannot comment on the DCIM side, but I agree on the "everything as code" mantra.
For a relatively small setup I chose a combination of Ansible, Kubernetes and Dockerfiles, but probably any combination will do. All these files are stored in a git repo.
Even after months (or years) neglect, I can easily know what I configured (and why!) and update where needed with a minor effort.
I'm going to mostly disagree with everyone here, much to my karma's detriment ;P
I agree the end-goal should be infrastructure as code, and everyone here has covered those tools well. You also want monitoring across your infrastructure. Prometheus is the new poster-boy here, but the Nagios family, and many other decent OSS solutions exist as well.
But you still need documentation. Your documentation should exist wherever you spend most of your time. Some examples:
* If you spend most of your time on a Windows Desktop, doing windows admin type things, then OneNote or some other GUI note-taking/document program makes sense.
* If you spend most of your time in Unix land(linux, BSD, etc) then plain text files on some shared disk somewhere for everyone to get to, makes WAY more sense. Bonus if you put these files in a VCS, and treat it like code, and super bonus if your documentation is just a part of your Infra as code repositories.
* If you spend your time in a web browser, then use a Wiki, like MediaWiki, wikiwiki, etc.
In other words, put your documentation tools right alongside your normal workflow, so you have a decent chance of actually using it, keeping it up to date, and having others on your team(s) also use it.
We put our docs in the repo's right alongside the code that manages the infrastructure.. in plain text. It's versioned. We don't publish it anywhere, it's just in the repo, but then we spend most of our time in editors messing in that repo.
I totally agree, but having "infrastructure as code" means less documentation.
Instead of documenting all the commands involved in configuring a machine as service X (ssh, run apt-get, paste this, etc.), I have documentation on how work with the configuration management system (roles in the roles/ directory, each node gets one role, commit to git, open PR, etc.). That documentation is in .md files in the config management source repo.
Instead of documenting how to rack a server (print and attach label to front and back, plug power into separate PDUs, enter PDU ports into management database, etc.), I document Terraform conventions (use module foo, name it xxx-yyy, tag with zzz, etc.).
It ends up being less documentation, as the "code" serves to document the steps taken, so the documentation can be higher level. Or if it isn't less documentation, it is documentation that needs to be updated less often, so hopefully there will be less drift between docs and what actually exists.
Yes, I didn't cover what goes into the documentation, as that is mostly site-specific, but I mostly agree with you... mostly. Instead of documenting run apt-get, ssh, etc to start up service X, now you have to document how your tools are setup, Ansible, Terraform, etc. Plus your code needs documentation about why it's setup the way it is.
You still need high-level stuff, policies, etc. Security guides, none of this has changed.
You also have to document your snowflakes, how you handle the wacky snowflakes, why they exist, etc.
Ideally your documentation should be such that it would pass the hit-by-a-bus test. I.e. if you or your entire team got hit by a bus, someone with a clue could come in, read your documentation and continue.
My docs are not at that stage, but every time I mess about with something I try to read through the docs attached, and verify and add to them, so that hopefully someday we will get there.
Sit down with another sysadmin and have them go through your Terraform repo; if they have to ask more than 3 times why something is done a certain way, your "infrastructure as code" as documentation is insufficient.
More like code usually required extra documentation explaining it in a higher level language, but nowadays we just write the program on that higher level language so this extra documentation has gone away.
Windows sys admin here. OneNote is fantastic for IT documentation. I like that you can drag and drop a screenshot (no uploading to a wiki), store spreadsheets, word docs, PDFs, etc. and easily search for information via the built-in functionality.
We have used it for years and it has worked great for us.
It might be helpful if described your infrastructure. There is a pretty big difference between managing physical Windows servers in a data center and managing Linux servers all in AWS.
If you are all or mostly cloud, Terraform + config management with a CI pipeline takes care of a lot. Then a wiki that covers "Getting Started" and a few how-to articles.
For physical infra you need the setup for DHCP, updating DNS based on DHCP, PXE boot imaging, IPMI access and configuration, switch and router configuration, what servers are connected to which switch ports, PDU management and monitoring, and on and on and on.
I think it depends a lot on the size of your infrastructure. I've used excel docs on a shared drive pretty successfully where there's not much to keep up on and changes are few.
In larger infrastructure setups (small service provider) we used a combination of netboot, SNMP for monitoring with Observium and Nagios for alerting. We were also a big VMware environment, so naturally we had a lot of inventory tracking available through vCenter as well. I found a lot of opposition to Configuration Management, given the lack of comfort with programming of some sysadmins (Windows admins), so that's something to keep in mind as well. I think mixed environments also can be challenging w/infrastructure as code, but I'd be interested to see how others get through that.
The past decade has been interesting and I'm still processing it.
My current thoughts are that an appropriate approach is for your systems to document themselves via the applications that they run - inside out.
Though I must abide I cannot fully subscribe to "infrastructure as code" anymore. It has proven just another shift, primarily in toolsets and who (or what) gets say and sway over the capacity, capabilities and efficiencies of the thing you actually care about - the app stack and all of its assembled functionality.
In other words most approaches are still "outside in" - one defines 'x' for deploy fitments and that typically over and over and over again and, typically, with a rigidity that can too easily override and overrule effectively caging your application in scale and scope. With my current tact I am trying to provide for 'y' to "self identify" (via some/any form of config mgmt) where from here you can begin to effectively "deploy to any" by hooking the "application config as code" that, in turn, defines its infrastructure and deploys "outward". The "infrastructure as code" then becomes the servant with its objects and platform definitions etc. and the "appconfig as code" becomes the master where the latter defines its own scope and scale.
Infrastructures have a funny way of mutating into inefficient "definitions" of something that once made sense, on the first day, and forevermore complicating progress with capacity, rules and opinions.
But, generically, snmp is still pretty cool for telling me what I need to know. Strapped that into any end engine and, boom, ask any question, request any inventory.
So.. I track apps, not systems. Systems are expendable, applications are not.
There are several classes of "infrastructure" as a sysadmin; legacy, new and critical.
Legacy stuff is done the old fashioned way - portscans and nmap. If it has an open port, it's presumed to be intentional. If not, it's a target. I've seen some success using tools like Pysa to "blueprint" existing systems into Puppet code. Tools like SystemImager help here, too - enabling P2V and the creation of "file-based images" compatible with version control and able to PXE boot new clones.
New stuff is from-scratch IaC all the way to the metal. Ansible and git submodules help me build "sandwiches".
Critical stuff blurs the lines. The machines, IP addresses, ports and living connectivity can be documented, and "captured" to a limited extent with the manual mapping and Rsync stuff in the Legacy category. Some of this critical stuff is also "new", and is deployed in that fashion.
What about switchgear and Cisco configs? License strings, key management, site-specific patching - all can complicate things.
More important than any of these is the ability for you and those around you to see and manage the systems as they are launched and terminated.
In the old days, I used to use a shell script on a newly-provisioned host to dump all its' details - dmidecode, environment stuff and so on. Those details were pushed back to a common source and were a real benefit in the days before real config management came on the scene. CFEngine was way too complicated and nebulous at the time.
For me/us, it's a combination of infrastructure-as-code and metrics reporting/logs. Most of our boxes are swapped out on a weekly or more frequent basis, so the only accurate picture of what's running right this moment is the graphs built by the metrics collection tools. The only accurate picture of what's running on those boxes is the code which built the infrastructure.
There are a couple of exceptions, but those are actively being brought under the above model (mostly because they are effectively invisible, and the existing documentation for them is... incomplete).
Any documentation outside of that is stale in a few hours, and obsolete in a week.
Back when I was put in charge of IT Lifecycle management for my Army unit (not by choice - "Hey, you've got a CS degree, so anything tech related goes to you"), I kept it all in an Access Database, and ran off a report occasionally to update my smartbook (3-ring binder full of stuff that my boss would frequently ask about during meetings). Granted this was back in the early 00's.
As a professional sysadmin, my go to reference on this is "Documentation Writing for System Administrators", from the Short Topics in System Administration series.
I've used https://www.racktables.org with pretty good luck. It's PHP, which wouldn't be my first choice, but I've largely been able to make it do what I want.
If you want something more clever; say keeping track of asset values etc, you'll want a CMDB. Google around and you should find something that fits your needs. We used SeviceNow in a previous life.
We put everything in code. We have several layers, but they if you're new you can start with the lowest level and make your way up to find out how things are provisioned and configured.
We're on AWS so we use cloudformation for provisioning and saltstack (https://saltstack.com/) for configuration management. Cloudformation templates are written using stacker (http://stacker.readthedocs.io/en/stable/). All AWS resources are built by running "stacker build" so nothing is done by hand. We have legacy resources that we're slowly moving over to Cloudformation, but more than 90% of our infrastructure is in code.
On top of cloudformation and salt we built jenkins (CI and docker image creations), spinnaker (deployment pipeline), and kubernetes (deployment target). The jenkins and spinnaker pipelines are also codified in their own respective git repos.
All the repos here have sphinx setup for documentation purposes and the repos tend to crosslink for references.
So, one problem I’ve seen with most infrastructure as code solutions and CMDBs is that they do a good job at the tactical level (more or less), and help you answer “how”, “where”, “what”, and maybe “when” questions (depending on how well they support orchestration), they typically do a bad job at the higher level strategic “why” questions.
So, why do you structure your lambda jobs accessing CloudWatch Logs that way as opposed to the other way? If you didn’t know that one way works and the other doesn’t, you wouldn’t be able to understand that question. And that might have domino effects on other parts of your system.
I haven’t found a good solution to documenting the high level strategic “why” questions, other than to just write down the questions and the answer, with reasoning, in some form of associated documentation — maybe in a wiki or something. But, of course, the underlying issues may change in the near future and invalidate the reasons for your decision. And the high level documentation doesn’t have any way to be compiled directly into the lower level implementation, so of course there is always the risk of drift.
I’m still looking for good solutions in this space.
Vmware's tagging support is a lighter, more realistic option vs a "CMDB".
Come up with a key/value strategy that covers your need to track things like app name, app category, environment (test, dev, load testing, prod, prod/dmz, etc), and it becomes actually usable and up to date versus an always out-of-date CMDB. And it's compatible with cloud resource tagging.
Spinning up new infra: Jenkins crafts Terraform tfvars based on user input, runs plan, asks for confirmation, applies. Terraform state and vars saved to S3. Chef and Ansible for provisioning.
"Documentation", in terms of where stuff is deployed and what is deployed is not really necessary. We save this data to a DynamoDB table, query-able by AWS Lambda functions, so other automation can pick it up and devops can query data.
Documentation on how things work comes from dev teams, on how things are deployed indeed comes from us, just simple wiki pages.
Services running in Kubernetes, K8s worker instances in auto-scaling groups. If one node dies it is killed and brought up, K8s will reschedule the pods. Same for the pods themselves.
Monitoring through Nagios(getting phased out finally), NewRelic and Prometheus. Basic ELK stack for centralized logs.
Thinking about rolling out Vault for credential management. Chatops on the pipeline (getting pieces in place first, like the db mentioned earlier)
I'm trying to get the company on board on immutable infrastructure, but it is proving difficult.
Like many here, I keep it described in ansible and documentation inside a git repository.
But I feel like it's lacking. After a while you have so many ansible playbooks and roles that they cannot give you a birds-eye view anymore.
I think I would MUCH prefer to have some sort of HTML representation, where adding an instance/service starts by adding to that representation, and you could click on every link or node to show its golden image setup, ansible configuration, etc.
We are using Rundeck to get that. I spent a year or more basically being the only one in the company who could do Ansible runs, for various reasons, and Rundeck is the web UI we ended up setting up for the other part-time ops people and devs to use. I also looked at stackstorm, but it's free version didn't have user accounts which made it a non-starter for me. Plus, you really only interacted with it via chat, not a web page.
Ansible Tower lets you execute a playbook via a web GUI, and keeps a log of who executed what.
I'm not sure if it also shows some infrastructure graphs, but I'm talking about knowing if links are up, how they are firewalled, where the config for each thing is, etc.
When you host tens of services on hundreds of machines, this information is hard to get a grasp on, no matter what you do or how well you documented everything, because it takes a while to read through it.
By having your infrastructure defined in version control using some sort of domain specific language. For example, by using Terraform and only ever making changes to your infrastructure via Terraform (manual adding/editing of stuff in AWS/GCP console should be disabled so people can't do that). Then all changes to the infrastructure are clearly documented in version control with pull requests.
Aligning a VMWARE tagging strategy with a cloud tagging strategy is one of my current goals. Things like a full blown CMDB seem to always end in pain, lag, and orphaned records. I'm happy enough with something basic that spans on-prem + cloud.
Can I piggyback and ask how people keep track of deployed software? Like if I have 50 products deployed some of which haven't been touched in 10 years and I want to be able to ramp up a developer to fix a bug on any of them?
In the medical device industry you keep what's called a device history file which tracks the configuration of each device you've sold by serial number. This DHF is meticulously updated whenever something is changed. If someone reports an issue this is information you can use to scope your initial reproduction.
anyone have any pointers for simple an API driven managment of DNS/DHCP?
(like, I don't want to have to configure 1000 moving parts)
typically this seems to fall into the 'roll your own' or 'giant lumbering enterprise behemoth' category that does 10 other things. I'm looking for the sweet spot.
At any reasonable scale you typically wouldn’t use plain DNS if you have to do that kind of figuration. It would be done with a service discovery service which handles SRV records.
That being said route53 has a reasonable management API.
thanks - should have mentioned specifically not looking at cloud services
(e.g. self hosted, but without needing 5 different polyglot microservices and a service managment layer and 32GB of ram just to keep the whole mess running)
I see 0% need for this complexity in many cases on the presentation side - and if faster response is required internally, the same API IF can be used for service discovery or side-chain announcements, etc can be bolted on on a per-application basis if desired.
I also see 0% need for this to be a cloud exclusive domain - e.g. hybrid scope/location deployments, etc.
"Simple" is in the eye of the beholder, but if you're looking for dedicated software for DNS and DHCP that is manageable via API, look into PowerDNS and Kea.
however this doesn't solve the dhcp side of things.
specifically not looking at SaaS since I want consistent deployment flexibility and potential for mixed scale/scope/environment deployments (devbox, lan, wan, mixture, yadda)
I'm not sure I understand your question fully? You write documentation, like you do anything. And configure everything with code, so you can go read it (Terraform, Chef/Puppet/Ansible, etc).
I use SCC (System Configuration Collector) to document our servers. Everything else is just a collection of grep-able text files on our management server.
https://sourceforge.net/projects/sysconfcollect/
I built it up using nmap and then shelling into each individual machine and poking around to see what it did. This was back in the days before everything became virtualized, so each machine on the network was likely physical.
I added information by walking the aisles and copying down the rack location of every machine into another page on the spreadsheet. I eventually hooked up a terminal to them all and matched network addresses to physical machines.
Only took a few weeks and when I was done, I knew things about the network that guys who worked at the business for years didn't know.
There's no substitute for the good old-fashioned way.
I liked that job, it was fun.