A Systems Policy
Recently I talked to a couple of friends, which all wailed quite a bit about their operations or internal IT departments.
Most of these teams had to fight with some very basic things. They lacked a decent monitoring system or monitoring at all. They didn’t deploy systems, they installed it by hand. Systems where not documented etc.
So here are some guidelines, I try to aspire with my team. This is by far not a complete list of things you need to run successful operations but it should give you a fair hint about what it takes.
Also please note that you might want to adapt your own policy a bit to fit your needs. I’m coming from the web industry, but we still run our own hardware, so this might especially not fit a typical cloud based infrastructure.
A System is considered the lowest part of our infrastructure and services. All rules defined here, should be considered in all other policies.
- is documented at a central location.
- is monitored and being graphed.
- is being backuped.
- is updated regularly.
- has a defined production level. (spare, pre-production, production)
- has a defined owner and maintainer.
- has a predefined maintenance level.
- has a predefined availability.
- has a physical location.
- has a unique name, which is resolvable by DNS.
- has only required software installed.
- was installed with all currently available updates.
- was inspected and approved by a second man before being released to production.
- All parts are functional at any time. All Faults get documented RFN and repaired as soon as possible.
- There are always 2+ people informed about it.
- Network access vectors are defined.
- Configurations are not only available locally (including scripts).
- Sensible data gets protected.
A piece of hardware can be anything from a big server to a small temperature sensor in your server room.
A piece of hardware…
- has a maintenance contract or spare hardware available.
- has got an inventory number.
- is labeled (hostname + inventory).
- is physically secure (environmental! and mechanical access control).
- has got a bill, which is documented at a central location.
- should have redundant power supplies.
- should have some kind of out of band management solution (OOB).
- has at least one power circuit connected to an electronic circuit protected by an uninterruptible power supply (USV).
All tools needed to open and repair any part of the system are available.
- has at least two disks configured with RAID >= 1.
- has at least two separate network interface cards (NICs).
- has all RAID controllers backed with battery backed write caches (BBWC).
- was dimensioned with adequate future-proof hardware.
- has a lifetime of 2+ years.
- is manage- or configurable.
- is supported by the configuration backup software in use (e.g. RANCID)
- provides the following protocols: STP, SNMP, IPv6 support (mgmt+multicast), RADIUS for AAA
- does not forward the default VLAN (1) on it’s uplink/trunk ports.
- does have a description for every port in use (including hostname and interface, e.g.: server01#eth0, server01#oob, switch03#24)
- does not have any enabled, unused ports: set them to disabled and remove any other configuration for this port.
- blocks or does not forward any discovery protocols on it’s user ports.
- is using AAA for authenticating users.
- logs to a central syslog server.
An operating system (OS) is considered as everything running on a server or instance, to support a service or an application.
An Operating System…
- uses OS-CHOICE-HERE/stable as default distribution on servers.
- uses OS-CHOICE-HERE as default on clients.
- is rebooting without any manual interventions.
- provides access by SSH.
- does not permit root login via SSH.
- has a root password set.
- has the current time, synchronized with a time server and uses TIMEZONE-CHOICE-HERE as time zone.
- can resolve internal and internet names via DNS.
- installs software by packages.
- installs packages from a central internal repository and the official distribution repositories.
- software installed by packages should conform to the FHS.
- software not installed by packages should be installed by a reproducible deployment process.
- has sane defaults set, for user and process environments (locales, shells, screen, got some handy tools, etc.).
- should not provide typical compiler tools (gcc, build-essential).
- provides a manageable AAA concept (e.g. automated provisioning and de-provisioning of staff users).
- sends mails destinated for root to a central location.
- provides a local mailer.
Hostnames exist to identify every part of your infrastructure uniquely. They are used to refer to systems in your configurations and in discussions. You should think about a naming convention, but here are some rough guidelines.
- have to be unique.
- have to end with a number, which should never be reused and always be incremented.
A service is considered as everything running on a server’s operating system, to provide continuous functionality (e.g. a script or an application).
- does only log errors and auditing information. Application services may as well log more information (e.g. Apache access log).
- has defined log retention times.
- logs to syslog unless it’s not possible.
- is authenticating only on secure connections.
- has an adequate and future-proof dimensioned datastore.
- was deployed in a reproducible way.
A network is considered any part of infrastructure, which is used to interconnect servers or systems. (Layer 1,2,3,4,…)
- has clear entry and routing points.
- has a diagram which describes access vectors, the logical and physical setup.
- is deployed in adequate and future-proof dimensions (vlans, ip addresses, bandwidth).
- uses structured cabling.
- there is no cross-cabling, except for very rare situations (e.g. HA cabling).
- should not be used for multiple purposes at least not share one of the following classifications.
|mgmt||Management network (monitoring, remote access)|
|traffic||Site local traffic network|
|backup||Traffic network for backups|
|voip||Voip Telephony network|
|clients||A network with client workstations.|
|devel||A network with development machines.|
|staging||A network with staging equipment.|
- OOBs are easy to reach, even in case of an outage.
- VLAN-IDs are considered global, create a list.
- All VLAN-IDs below 99 are switch-local.
- VLANs have a name and a location.
- All address space is considered global (vlans, ip- and mac addresses, including RFC1918)
To round up my article, here is a example checklist we use to peer review new systems:
Example Review Checklist
Every newly deployed host or instance should undergo a peer-review process. The checklist below will provide you with a couple of base acceptance criteria and is going to ensure a certain level of quality. Give it to any other sysadmin and ask him or her to check the system, before it’s put into production.
- DNS works (including reverse dns) :
- SSH login works :
- Host+services monitored :
- Host+services graphed :
- All Filesystems backuped :
- Database dumps :
- All Updates installed :
- Host in HostDoc :
- Puppet works :
- Time is accurate :
- Root mails are being delivered :
- Firewall is active :
- No unneeded services are reachable (nmap) :
- Network configuration works (+ipv6) :
- Syslog/dmesg/oob logs are clean of errors :
-- Physical Host --
- Root password documented :
- Root login works :
- OOB password documented :
- OOB login works :
- OOB monitored :
- Switch ports are labeled (+ documented) :
- Hardware is labeled (+ documented in rack docu) :
- Firmware up to date :
- RAID level is > 1 and all disks OK :