A Systems Policy

Recently I talked to a couple of friends, which all wailed quite a bit about their operations or internal IT departments.

Most of these teams had to fight with some very basic things. They lacked a decent monitoring system or monitoring at all. They didn’t deploy systems, they installed it by hand. Systems where not documented etc.

So here are some guidelines, I try to aspire with my team. This is by far not a complete list of things you need to run successful operations but it should give you a fair hint about what it takes.

Also please note that you might want to adapt your own policy a bit to fit your needs. I’m coming from the web industry, but we still run our own hardware, so this might especially not fit a typical cloud based infrastructure.

Systems

A System is considered the lowest part of our infrastructure and services. All rules defined here, should be considered in all other policies.

A system….

Hardware

A piece of hardware can be anything from a big server to a small temperature sensor in your server room.

A piece of hardware…

All tools needed to open and repair any part of the system are available.

Servers

A server…

Switches

A switch

Operating Systems

An operating system (OS) is considered as everything running on a server or instance, to support a service or an application.

An Operating System…

Hostnames

Hostnames exist to identify every part of your infrastructure uniquely. They are used to refer to systems in your configurations and in discussions. You should think about a naming convention, but here are some rough guidelines.

Hostnames …

Services

A service is considered as everything running on a server’s operating system, to provide continuous functionality (e.g. a script or an application).

A service…

Networks

A network is considered any part of infrastructure, which is used to interconnect servers or systems. (Layer 1,2,3,4,…)

A Network…

Class Description
net Internet/upstream network
mgmt Management network (monitoring, remote access)
traffic Site local traffic network
backup Traffic network for backups
voip Voip Telephony network
clients A network with client workstations.
devel A network with development machines.
staging A network with staging equipment.

To round up my article, here is a example checklist we use to peer review new systems:

Example Review Checklist

Every newly deployed host or instance should undergo a peer-review process. The checklist below will provide you with a couple of base acceptance criteria and is going to ensure a certain level of quality. Give it to any other sysadmin and ask him or her to check the system, before it’s put into production.

-- Physical Host --

OpenVZ API

I love OpenVZ, I think its one of the easiest to use virtualisation technologies on the market and it adds almost no overhead compared to other technologies.

I’ve been using it for a couple of years now and I always wanted to have a nicer way to automate container creation, configuration or actions than writing shell scripts. There are already a couple of webinterfaces around, but none of them I liked.

Another possibility would be to use libvirt - but libvirt always felt a bit too complex, since its a general API implementation for several hypervisors.

So I started to implement my own API, which should rather be a simple and minimalistic approach. The project is hosted on GitHub but you can as well install it by Rubygems.

gem install openvz

Restart a container

A small example to restart the container.

require 'rubygems'
require 'openvz'

container = OpenVZ::Container.new('109')
container.restart

Provisioning Example

Here is an example about how a whole provisioning for a new container could look like.

The script creates the container configuration and runs deboostrap, sets the nameserver, ip address and hostname. Afterwards it will run a couple of commands to update it and install Puppet.

require 'rubygems'
require 'openvz'

container = OpenVZ::Container.new('110')

container.create(:ostemplate => 'debain-6.0-boostrap',
                 :config     => 'vps.basic')

container.deboostrap(:dist   => 'squeeze',
                     :mirror => 'http://cdn.debian.net/debian')

container.set(:nameserver => '8.8.8.8',
              :ipadd      => '10.0.0.2',
              :hostname   => 'foo.ono.at')

container.start

# Update the system
container.command('aptitude update ; aptitude -y upgrade ; apt-key update')

# Install puppet
container.command('aptitude -o Aptitude::Cmdline::ignore-trust-violations=true -y install puppet')

# Run puppet
container.command('puppetd -t --server=puppet.ono.at')

Upgrading HP Firmware

Lately we bought a new HP blade chassis to replace a customer’s old database server. All it’s services run on ~15 blades, splitted cross two HP C7000 chassis.

The Proliant BL460 G6 we bought came with much newer firmware revisions than all the existing G1 – part of the infrastructure didn’t receive much sysadmin love over quite some time. :-)

Blades, ILO, chassis and controllers where all running way outdated firmware and upgrading was highly recommended. The arising firmware combinations haven’t been tested and the new blade wouldn’t even be detected, so HP. They offered us an upgrade for about $2000 and 6 hours of downtime per chassis.

Here are some handsome findings, to do the upgrade on your own:

HP Firmware Comapatibility Matrix

HP tested certain sets of firmware for compatibility. Take a look at their compatibility matrix and try to stay within the tested boundaries. This could mean to upgrade in more than one step, if you are running an older release.

(http://h18004.www1.hp.com/products/blades/components/c-class.html)

Hp-Firmware-Catalog

There is Christian Hofstedtlers great firmware upgrade script, which automatically downloads the latest and greatest HP firmware installation packages. Its even creating softlinks, to reference cryptic firmware package names to their corresponding hardware components.

(https://github.com/zeha/hp-firmware-catalog)

You can run them from your OS as an online upgrade. Certain components still might require rebooting, to finish the “delayed upgrade”.

I would love to see HP maintaining this, since the approach provides a good example of providing customers with a modern and automated way to upgrade and monitor firmware for more recent releases.

ILO Shell

When upgrading many machines it will save you a lot of time, if you just use the SSH shell for configuring a boot device and rebooting the server.

Connect to ILO using SSH

Make sure you send the right username, AFAIK it’s case sensitive on the ILO:

    ssh phx-vnode03.oob.ono.at -l Administrator

Set an ILO Advanced Licence key

    cd /map1
    set license=YOUR-LICENCE-KEY

The advanced licence key is required to enable virtual device firmware features. Eg. to make use of the remote console or a virtual disk boot drive.

Mount and configure a network hosted ISO image as boot device

    cd /map1
    vm cdrom insert http://10.0.10.21/FW920B.2010_1129.2.iso
    vm cdrom set boot_always

…be it a firmware upgrade or an OS installation disk. Make sure you run the following command to “eject” it again:

     cd /map1
     vm cdrom eject

Monitoring

To please your monitoring system as well, check out checkmk. They wrote a couple of good SNMP checks for your HP or IBM bladecenter.

In the end I can highly recommend to keep your hardware firmware up to date. At least HP, my vendor of choice, they add a lot of useful bug fixes.

HP currently informs customers by a e-mail newsletter about updates, I would love to see this in my monitoring system too, like all the other security upgrades.

Try to plan the upgrade a bit or use existing downtimes to boot the HP Firmware Maintenance image.

Older