August 24, 2012

A Systems Policy

Recently I talked to a couple of friends, which all wailed quite a bit about their operations or internal IT departments.

Most of these teams had to fight with some very basic things. They lacked a decent monitoring system or monitoring at all. They didn’t deploy systems, they installed it by hand. Systems where not documented etc.

So here are some guidelines, I try to aspire with my team. This is by far not a complete list of things you need to run successful operations but it should give you a fair hint about what it takes.

Also please note that you might want to adapt your own policy a bit to fit your needs. I’m coming from the web industry, but we still run our own hardware, so this might especially not fit a typical cloud based infrastructure.

Systems

A System is considered the lowest part of our infrastructure and services. All rules defined here, should be considered in all other policies.

A system….

is documented at a central location.
is monitored and being graphed.
is being backuped.
is updated regularly.
has a defined production level. (spare, pre-production, production)
has a defined owner and maintainer.
has a predefined maintenance level.
has a predefined availability.
has a physical location.
has a unique name, which is resolvable by DNS.
has only required software installed.
was installed with all currently available updates.
was inspected and approved by a second man before being released to production.
All parts are functional at any time. All Faults get documented RFN and repaired as soon as possible.
There are always 2+ people informed about it.
Network access vectors are defined.
Configurations are not only available locally (including scripts).
Sensible data gets protected.

Hardware

A piece of hardware can be anything from a big server to a small temperature sensor in your server room.

A piece of hardware…

has a maintenance contract or spare hardware available.
has got an inventory number.
is labeled (hostname + inventory).
is physically secure (environmental! and mechanical access control).
has got a bill, which is documented at a central location.
should have redundant power supplies.
should have some kind of out of band management solution (OOB).
has at least one power circuit connected to an electronic circuit protected by an uninterruptible power supply (USV).

All tools needed to open and repair any part of the system are available.

Servers

A server…

has at least two disks configured with RAID >= 1.
has at least two separate network interface cards (NICs).
has all RAID controllers backed with battery backed write caches (BBWC).
was dimensioned with adequate future-proof hardware.
has a lifetime of 2+ years.

Switches

A switch

is manage- or configurable.
is supported by the configuration backup software in use (e.g. RANCID)
provides the following protocols: STP, SNMP, IPv6 support (mgmt+multicast), RADIUS for AAA
does not forward the default VLAN (1) on it’s uplink/trunk ports.
does have a description for every port in use (including hostname and interface, e.g.: server01#eth0, server01#oob, switch03#24)
does not have any enabled, unused ports: set them to disabled and remove any other configuration for this port.
blocks or does not forward any discovery protocols on it’s user ports.
is using AAA for authenticating users.
logs to a central syslog server.

Operating Systems

An operating system (OS) is considered as everything running on a server or instance, to support a service or an application.

An Operating System…

uses OS-CHOICE-HERE/stable as default distribution on servers.
uses OS-CHOICE-HERE as default on clients.
is rebooting without any manual interventions.
provides access by SSH.
does not permit root login via SSH.
has a root password set.
has the current time, synchronized with a time server and uses TIMEZONE-CHOICE-HERE as time zone.
can resolve internal and internet names via DNS.
installs software by packages.
installs packages from a central internal repository and the official distribution repositories.
software installed by packages should conform to the FHS.
software not installed by packages should be installed by a reproducible deployment process.
has sane defaults set, for user and process environments (locales, shells, screen, got some handy tools, etc.).
should not provide typical compiler tools (gcc, build-essential).
provides a manageable AAA concept (e.g. automated provisioning and de-provisioning of staff users).
sends mails destinated for root to a central location.
provides a local mailer.

Hostnames

Hostnames exist to identify every part of your infrastructure uniquely. They are used to refer to systems in your configurations and in discussions. You should think about a naming convention, but here are some rough guidelines.

Hostnames …

have to be unique.
have to end with a number, which should never be reused and always be incremented.

Services

A service is considered as everything running on a server’s operating system, to provide continuous functionality (e.g. a script or an application).

A service…

does only log errors and auditing information. Application services may as well log more information (e.g. Apache access log).
has defined log retention times.
logs to syslog unless it’s not possible.
is authenticating only on secure connections.
has an adequate and future-proof dimensioned datastore.
was deployed in a reproducible way.

Networks

A network is considered any part of infrastructure, which is used to interconnect servers or systems. (Layer 1,2,3,4,…)

A Network…

has clear entry and routing points.
has a diagram which describes access vectors, the logical and physical setup.
is deployed in adequate and future-proof dimensions (vlans, ip addresses, bandwidth).
uses structured cabling.
there is no cross-cabling, except for very rare situations (e.g. HA cabling).
should not be used for multiple purposes at least not share one of the following classifications.

Class	Description
net	Internet/upstream network
mgmt	Management network (monitoring, remote access)
traffic	Site local traffic network
backup	Traffic network for backups
voip	Voip Telephony network
clients	A network with client workstations.
devel	A network with development machines.
staging	A network with staging equipment.

OOBs are easy to reach, even in case of an outage.
VLAN-IDs are considered global, create a list.
All VLAN-IDs below 99 are switch-local.
VLANs have a name and a location.
All address space is considered global (vlans, ip- and mac addresses, including RFC1918)

To round up my article, here is a example checklist we use to peer review new systems:

Example Review Checklist

Every newly deployed host or instance should undergo a peer-review process. The checklist below will provide you with a couple of base acceptance criteria and is going to ensure a certain level of quality. Give it to any other sysadmin and ask him or her to check the system, before it’s put into production.

DNS works (including reverse dns) :
SSH login works :
Host+services monitored :
Host+services graphed :
All Filesystems backuped :
Database dumps :
All Updates installed :
Host in HostDoc :
Puppet works :
Time is accurate :
Root mails are being delivered :
Firewall is active :
No unneeded services are reachable (nmap) :
Network configuration works (+ipv6) :
Syslog/dmesg/oob logs are clean of errors :

-- Physical Host --

Root password documented :
Root login works :
OOB password documented :
OOB login works :
OOB monitored :
Switch ports are labeled (+ documented) :
Hardware is labeled (+ documented in rack docu) :
Firmware up to date :
RAID level is > 1 and all disks OK :

OpenVZ API

I love OpenVZ, I think its one of the easiest to use virtualisation technologies on the market and it adds almost no overhead compared to other technologies.

I’ve been using it for a couple of years now and I always wanted to have a nicer way to automate container creation, configuration or actions than writing shell scripts. There are already a couple of webinterfaces around, but none of them I liked.

Another possibility would be to use libvirt - but libvirt always felt a bit too complex, since its a general API implementation for several hypervisors.

So I started to implement my own API, which should rather be a simple and minimalistic approach. The project is hosted on GitHub but you can as well install it by Rubygems.

gem install openvz

Restart a container

A small example to restart the container.

require 'rubygems'
require 'openvz'

container = OpenVZ::Container.new('109')
container.restart

Provisioning Example

Here is an example about how a whole provisioning for a new container could look like.

The script creates the container configuration and runs deboostrap, sets the nameserver, ip address and hostname. Afterwards it will run a couple of commands to update it and install Puppet.

require 'rubygems'
require 'openvz'

container = OpenVZ::Container.new('110')

container.create(:ostemplate => 'debain-6.0-boostrap',
                 :config     => 'vps.basic')

container.deboostrap(:dist   => 'squeeze',
                     :mirror => 'http://cdn.debian.net/debian')

container.set(:nameserver => '8.8.8.8',
              :ipadd      => '10.0.0.2',
              :hostname   => 'foo.ono.at')

container.start

# Update the system
container.command('aptitude update ; aptitude -y upgrade ; apt-key update')

# Install puppet
container.command('aptitude -o Aptitude::Cmdline::ignore-trust-violations=true -y install puppet')

# Run puppet
container.command('puppetd -t --server=puppet.ono.at')

Upgrading HP Firmware

Lately we bought a new HP blade chassis to replace a customer’s old database server. All it’s services run on ~15 blades, splitted cross two HP C7000 chassis.

The Proliant BL460 G6 we bought came with much newer firmware revisions than all the existing G1 – part of the infrastructure didn’t receive much sysadmin love over quite some time. :-)

Blades, ILO, chassis and controllers where all running way outdated firmware and upgrading was highly recommended. The arising firmware combinations haven’t been tested and the new blade wouldn’t even be detected, so HP. They offered us an upgrade for about $2000 and 6 hours of downtime per chassis.

Here are some handsome findings, to do the upgrade on your own:

HP Firmware Comapatibility Matrix

HP tested certain sets of firmware for compatibility. Take a look at their compatibility matrix and try to stay within the tested boundaries. This could mean to upgrade in more than one step, if you are running an older release.

(http://h18004.www1.hp.com/products/blades/components/c-class.html)

Hp-Firmware-Catalog

There is Christian Hofstedtlers great firmware upgrade script, which automatically downloads the latest and greatest HP firmware installation packages. Its even creating softlinks, to reference cryptic firmware package names to their corresponding hardware components.

(https://github.com/zeha/hp-firmware-catalog)

You can run them from your OS as an online upgrade. Certain components still might require rebooting, to finish the “delayed upgrade”.

I would love to see HP maintaining this, since the approach provides a good example of providing customers with a modern and automated way to upgrade and monitor firmware for more recent releases.

ILO Shell

When upgrading many machines it will save you a lot of time, if you just use the SSH shell for configuring a boot device and rebooting the server.

Connect to ILO using SSH

Make sure you send the right username, AFAIK it’s case sensitive on the ILO:

    ssh phx-vnode03.oob.ono.at -l Administrator

Set an ILO Advanced Licence key

    cd /map1
    set license=YOUR-LICENCE-KEY

The advanced licence key is required to enable virtual device firmware features. Eg. to make use of the remote console or a virtual disk boot drive.

Mount and configure a network hosted ISO image as boot device

    cd /map1
    vm cdrom insert http://10.0.10.21/FW920B.2010_1129.2.iso
    vm cdrom set boot_always

…be it a firmware upgrade or an OS installation disk. Make sure you run the following command to “eject” it again:

     cd /map1
     vm cdrom eject

Monitoring

To please your monitoring system as well, check out checkmk. They wrote a couple of good SNMP checks for your HP or IBM bladecenter.

In the end I can highly recommend to keep your hardware firmware up to date. At least HP, my vendor of choice, they add a lot of useful bug fixes.

HP currently informs customers by a e-mail newsletter about updates, I would love to see this in my monitoring system too, like all the other security upgrades.

Try to plan the upgrade a bit or use existing downtimes to boot the HP Firmware Maintenance image.

Older