Dec 172018
 

The OpenShift community produces a lot of interesting tutorials about how to try new solutions and configurations but unfortunately they are mostly based on a minimal setup such as MiniShift, which is definitely a cool gimmick, but badly resembles a real cluster setup. Often those posts only concentrate on the known good path about how something is supposed to function in the best case. They rarely mention how it could be debugged or fixed if it doesn’t work as expected. As all of us know, the more complex a system is, the more can go wrong and this technology is no exception especially when run in a real distributed setup. To give you some insight in how such procedures can go wrong, I’d like to share the experience I made when I tried to update my multi-master/multi-node OKD cluster. As an experienced Linux engineer or developer you might think that version updates are nothing special or exciting, but this experience will disabuse you. I hit many issues and here is how I did it.

IMPORTANT: This is not a guide how to upgrade OpenShift. It’s only a field report which is missing a lot of technical details for a successful upgrade. Please always investigate the official documentation.

Starting Position

At home I run a small OKD cluster consisting of three masters, each also hosting a etcd member, and four nodes, of which there are two infrastructure nodes, hosting the routers and registry, and two compute nodes, hosting the applications. I feel that’s the minimal setup required to resemble a production-like cluster. To make the setup a bit more interesting the persistent storage is served by a container-native storage (CNS) configuration were three GlusterFS pods are distributed on the masters. This definitely deviates from how a production cluster should be setup, but as I’m running this locally at home, I don’t have enough resources available for separate storage nodes.

My masters and nodes are running atop of CentOS Atomic Host which I updated to the latest 7.1811 release just a few days ago. As identity provider for OpenShift I’m using a FreeIPA server on a separate CentOS box. Since I installed this cluster with OpenShift Origin 3.9 five months ago, it was running stable and I had a lot of fun with it. After the recently published security advisories time has come to finally take a chance to upgrade OpenShift.

What will change?

The first thing, you should always do before starting an OpenShift update, is to carefully read the release notes. I explicitly linked the Red Hat OpenShift Container Platform (OCP) release notes here, because OKD unfortunately doesn’t nicely touch up theirs. For the initial release update they are mostly congruent. Make sure to study it carefully, as this might be the primary source of information once something starts going down. For the 3.10 update, an important information is the new handling of the containerized master controller and API services. Eventually we now have a basic idea about what to expect.

Updating the Ansible Inventory

It would be nice, if there was a fool-proof command to run the update and it seems that OpenShift 4 with its Cluster Version Operator is heading there. But until then we need to carefully study and follow the official OKD 3.10 Upgrade Guide. It’s important to get the documentation for the correct release because the involved adjustments to the Ansible inventory are different from release to release. For those not knowing how OpenShift 3.x release upgrades work, it’s done via Ansible playbooks which are using the same inventory (definition on how everything will be configured) as the initial OpenShift cluster installation.

In my inventory file, I first added the Node Group assignments. E.g. the infrastructure nodes are no longer defined via openshift_node_labels variable, but via a dedicated openshift_node_group_name variable which references a node group definition from the openshift_node_groups configuration. The same changes have to be done also for the master and the compute nodes:

  • OpenShift 3.9:
    [infra-nodes:vars]
    # Set region to be dedicated for infrastructure pods
    openshift_node_labels={'region': 'infra', 'zone': 'default'}
    
  • OpenShift 3.10:
    [infra-nodes:vars]
    # Set infra node group
    openshift_node_group_name='node-config-infra'
    

Note that although the openshift_node_labels variable is no longer effective, no labels will be removed during the upgrade. So if you don’t get the label definition right at the beginning you don’t have to worry that after an in-place upgrade some workload is suddenly not scheduled anymore.

I had some custom openshift_node_kubelet_args defined in my OpenShift 3.9 inventory but this variable is also not respected any longer. With 3.10 the correct way to customize the node configuration is to define a edits argument in the corresponding node group definition, which is then applied to a ConfigMap resource by the custom yedit Ansible module. While writing such a definition itself is already not super intuitive, it can only be done by re-defining the entire openshift_node_groups variable, possibly also blowing up every other node group definition if done wrong. For the moment, I chose to drop my custom node configuration entirely to make the inventory less error prone for now.

Before running the upgrade playbooks it’s also important that every manual configuration update done in the past (e.g. in the master-config.yml) has to be reflected somewhere in the Ansible inventory. Otherwise the change might be lost after the upgrade. In my setup I still had to add the LDAP authenticator to the openshift_master_identity_providers variable because I added it manually after the initial cluster installation.

The section about Special Considerations When Using Containerized GlusterFS gave me a bit of a bad feeling as my GlusterFS pods are running on the control-plane hosts. But it’s not an easy task to change that now, so I chose to still go on with the upgrade and hope for a work-around in case something should break.

Fixing a failed CNS Brick Process

Once I felt confident that my inventory was in good shape, I started the control-plane upgrade playbook placed at playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml. Only after a few minutes it already failed for the first time. The error message said, that my hosted registry presistent volume is not healthy. That are good news, because the playbook did properly detect my unconventional CNS setup and was even able to check the healthiness. Fortunately, this issue was already familiar to me and it was easily fixed. Here is how you do this:

  1. Change to the glusterfs project (or any custom project where you are running the GlusterFS pods) as a project or cluster administrator and query the names of the GlusterFS pods:
    $ oc project glusterfs
    Now using project "glusterfs" on server "https://openshift.example.com:8443".
    $ oc get pods -n glusterfs -o wide
    NAME                      READY     STATUS    RESTARTS   AGE       IP            NODE
    glusterfs-storage-dz8qj   1/1       Running   3          2d        10.0.0.10     master01.example.com
    glusterfs-storage-jncsl   1/1       Running   3          1d        10.0.0.11     master02.example.com
    glusterfs-storage-r24wg   1/1       Running   2          11h       10.0.0.12     master03.example.com
    heketi-storage-1-w8c42    1/1       Running   1          6d        10.129.0.21   node02.example.com
    
  2. Connect to one of the GlusterFS pods and list the volume status. E.g.:
    $ oc rsh glusterfs-storage-dz8qj
    sh-4.2# gluster volume status
    Status of volume: glusterfs-registry-volume                                 
    Gluster process                             TCP Port  RDMA Port  Online  Pid  
    ------------------------------------------------------------------------------
    Brick 10.0.0.11:/var/lib/heketi/mounts/vg_61                                 
    bc5a26248cc6ea9fb7ffaae4edbe93/brick_128dfe                                 
    b5436dad25702e689d3d6f4b8a/brick            49152     0          Y       204
    Brick 10.0.0.12:/var/lib/heketi/mounts/vg_5e                                   
    cf67ed85f71cf28090d7db1acc6433/brick_d36469                                 
    4c7034276c22e264eb2576413b/brick            49152     0          N
    Brick 10.0.0.10:/var/lib/heketi/mounts/vg_49                                   
    e81b0f4c91942c7657e9b7ffff7834/brick_8717e0                                 
    992390a7c04431890ba56b7656/brick            49152     0          Y       224
    Self-heal Daemon on localhost               N/A       N/A        Y       215
    Self-heal Daemon on 10.0.0.11               N/A       N/A        Y       163  
    Self-heal Daemon on 10.0.0.12               N/A       N/A        Y       172  
                                                                                
    Task Status of Volume glusterfs-registry-volume                             
    ------------------------------------------------------------------------------
    There are no active volume tasks
    [...]
    

    According to my experience it can happen that sometimes a brick displays a N in the Online column which means that the corresponding brick process wasn’t started successfully. If multiple bricks of the same volume are down, your entire volume is down and must be properly recovered. In such a case don’t continue with the steps below!

  3. Via IP address of the brick, you can figure out which host is affected and then you can simply delete the corresponding pod:
    $ oc delete pod glusterfs-storage-r24wg
    

    The pod will be automatically restarted and the brick processes should come up this time.

Fixing the Hosted Registry Storage Definition

The second run of the control-plane playbook eventually attested that all GlusterFS volumes are healthy but again it failed only two tasks later with a rather cryptic error message, something like:

TASK [openshift_storage_glusterfs : Check for GlusterFS cluster health] **********************************************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/cluster_health.yml:4
Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py
FAILED - RETRYING: Check for GlusterFS cluster health (120 retries left).Result was: {
    "attempts": 1, 
    "changed": false, 
    "invocation": {
        "module_args": {
            "cluster_name": "registry", 
            "exclude_node": "master01.example.com", 
            "oc_bin": "/usr/local/bin/oc", 
            "oc_conf": "/etc/origin/master/admin.kubeconfig", 
            "oc_namespace": "default"
        }

Didn’t it just said, that all the GlusterFS volumes are healthy? What the heck is "cluster_name": "registry" and what is it doing with GlusterFS in the ‘default’ namespace anyway?

The solution for this, I found after digging deep in the openshift_storage_glusterfs Ansible role and reading the CNS installation instructions again and again. I became a victim of my “simplified” CNS setup. The reference installation is meant to have two separate CNS GlusterFS clusters. One exclusively for the hosted registry volume (hinted by the [glusterfs_registry] Ansible host group) and a second cluster for any other persistent volumes (hinted by the [glusterfs] Ansible host group). As mentioned before, I’m limited in available hosts so I added the master hosts to both host groups and set the glusterfs_devices variable to the same device when installing the CNS. That’s already everything that was needed to create the hosted registry volume with OpenShift 3.9 in the “regular” CNS cluster. However the 3.10 playbook expects the registry volume to be in a different project with a different naming. Fortunately all that was needed to fix this were some additional inventory variables in the [OSEv3:vars]:

# Adjust variables for registry storage to match default converged glusterfs storage setup
openshift_storage_glusterfs_registry_name=storage
openshift_storage_glusterfs_registry_namespace=glusterfs

With the updated inventory I started the control-plane upgrade playbook once more. This time it ran for quite a while and even started to do some real stuff. It replaced the docker run command in the ‘origin-node’ systemd service with a runc command using the 3.10 image. Finally some progress. But eventually another error aborted the playbook and again it was a totally unexpected one.

etcd Backup Failure

Before updating the etcd cluster, there is a task which would backup the etcd database and this failed miserably. It couldn’t run docker exec etcd_container etcdctl backup [...]. When executing the command manually on a master host, I received the same error message:

rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""

My first suspicion was the Ansible role. Maybe the backup command is wrong? But I couldn’t find any radical changes in the commit history regarding the etcd backup and a blocking issue like this is unlikely to stay unnoticed for such a long time. Maybe something wrong with the image? The OpenShift Origin 3.9 setup was using a rather atypical image at that time, the only one from the Fedora image registry (registry.fedoraproject.org/latest/etcd:latest). When I see the ‘latest’ tag being used with containers I’m instantly suspicious that bugs might sneak in unnoticed as different users may get different images depending on when they are pulling them. Maybe they mistakenly pushed an image without a shell or without the etcdctl binary? So I tried to ask in the #openshift IRC channel on Freenode if someone experienced the same issue before but didn’t get any reply. Suddenly I had an idea: Only a few hours before I was using the etcdctl tool from the Atomic host to do my own etcd backup. I just need to find a way to make Ansible use the etcdctl from the host and everything would be fine. So I was digging a bit in the Ansible etcd role and a few minutes later I set r_etcd_common_etcdctl_command to "etcdctl" in my inventory, being confident that this would fix my issue. It won’t, but I won’t find out anytime soon…

The Master API cannot find the LDAP CA Certificate

In the next attempt, the playbook happily ran the etcd backup, upgraded the etcd images, converted them to be a static pod on all masters and did the same for the other two control plane services, the API service and the controller service, starting on the first master. Eventually the ‘origin-master-api’ and ‘origin-master-controller’ services were shutdown and the corresponding pods should be started, so the playbook was waiting for the API service to come up and waited and waited… The pod didn’t come up. Hmpf. Time has already come to use the new debugging command I read about in the release notes to see what’s going on:

# /usr/local/bin/master-logs api api

That is an alternative for the corresponding oc command that I’m also able to run from my client machine:

$ oc logs master-api-node01.example.com -n kube-system

But the latter one was behaving weird. Sometimes it hung although the API services of the two other nodes were still up. There is definitely something wrong.

When checking the logs locally, I saw an error that my FreeIPA CA certificate which should be used to validate the LDAPS connections cannot be found. That’s strange. I explicitly configured the ca key in the openshift_master_identity_providers variable pointing it to the correct CA certificate. I did this in other OpenShift cluster inventories before and there it was working… But those were not running OpenShift 3.10 or later. With 3.10 the playbook developers removed the possibility to custom-name the CA certificate so the ca key from the inventory was silently ignored. Only after checking the installation instructions regarding Configuring identity providers with Ansible, I found an inconspicuous comment that the CA certificate destination path now follows a given naming convention. When I was adding the identity provider configuration to the inventory before the update, I didn’t specify the openshift_master_openid_ca or openshift_master_openid_ca_file variables which will ensure that the CA certificate is copied to the correct place. After all the certificates were already on the master hosts and the identity provider was working, so I didn’t want the upgrade playbook to touch the certificate. Now that’s the result: My mistake. Still, I like issues that are clear and can be fixed so easily. A quick rename of the certificate on the master hosts made the API service successfully start again.

How a Docker Bug broke the etcd Cluster

All API servers are running again, although only the first one in the final configuration, but the oc command invocation still feel sluggish. Sometimes it even hangs completely. When checking the process list it attracted my attention that the etcd processes are only a few minutes old and sometimes they are not running at all. So I was checking the etcd cluster health and here it is: Two cluster members are down and one is in the state unhealthy. That is bad… Immediately, I started manually triggering etcd restarts. But only a few minutes later they shutdown again. I was checking the log files and there were errors, but I couldn’t figure out a single reason what might cause this mess. Then I found that the /etc/etcd/etcd.conf was updated during the playbook run, so I restored the backup, but again it wouldn’t fix the issue. Eventually I started to accept the thought that I might need to completely restore the etcd database from a backup because the database might already be so corrupt in the mean time that it is not able to find a stable state anymore.

The OKD documentation for Restoring etcd quorum would be the correct guide that you need to follow in this situation, but for a reason I landed at Restoring etcd. That confronted me with yet another issue: This guide was not yet properly updated for OpenShift 3.10. Some parts of the documentation still reference etcd as systemd service. But in my setup it’s a pod. Trying to pass the --force-new-cluster parameter to the etcd process via systemd override obviously doesn’t have any effect. Eventually I found out about the /etc/origin/node/pod/etcd.yaml file which contains the pod definition. And here we are able to correctly pass the parameter so that it is picked up by the pod startup command. But again, even with an empty database, a few minutes later my pod would die again. Something is badly broken here. In the YAML definition I also found the liveness probe. So once the pod was started once more, I tried to execute the liveness probe to see what it returns and the result looked familiar, in a bad way:

rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""

Ouch! Now I understand why the pods keep restarting. It’s again the same error that also caused the etcd backup failure before. But now I’m using the new quay.io/coreos/etcd:v3.2.22 image. This disproved my theory that a buggy image might be the reason. For the moment, I ran out of ideas… Until I remembered that I recently read a post about a docker bug (#1655214) that affected CentOS 7. Thanks for that! After checking the docker version on Atomic host 7.1811 (docker-1.13.1-84.git07f3374.el7.centos) it’s confirmed. That’s the root cause for so much trouble so far.

Updating Docker on Atomic Host

I didn’t need to dig too much into Atomic Host so far, as most of the stuff was simply working or was easily fixed with an update in the past. But this time it didn’t look that there is an imminent update. Release 7.1811 was only a few days old. I could roll back, but the previous version is 7.1808. That’s three months back and somehow defeats the purpose of my update, to get the latest security fixes. Fortunately CentOS already released new docker packages where this bug is fixed. Now I just need to find a way to update the docker packages independently from the ostree image? This time the documentation gods were on my side. I quickly found Dusty Mabe’s Atomic Host 101 Lab Part 4: Package Layering, Experimental Features.

Here my guide for quickly working around Bug #1655214 by updating the docker packages to release 1.13.1-88.git07f3374.el7.centos on CentOS Atomic Host:

  1. Create a temporary directory and download the corresponding RPM packages from a mirror of your choice:
    # mkdir /tmp/docker-1.13.1-88
    # cd /tmp/docker-1.13.1-88
    # for pkg in docker docker-client docker-common docker-lvm-plugin docker-novolume-plugin ; do \
        curl -O https://mirror.init7.net/centos/7/extras/x86_64/Packages/$pkg-1.13.1-88.git07f3374.el7.centos.x86_64.rpm ; \
      done
    
  2. From within the directory run rpm-ostree override replace to replace the docker packages from the ostree layer with the new RPMs:
    # rpm-ostree override replace docker*
    Checking out tree ee5a6f2... done
    Inactive requests:
      docker (already provided by docker-2:1.13.1-84.git07f3374.el7.centos.x86_64)
    Enabled rpm-md repositories: base updates extras
    Updating metadata for 'base': [=============] 100%
    rpm-md repo 'base'; generated: 2018-11-25 16:00:34
    Updating metadata for 'updates': [=============] 100%
    rpm-md repo 'updates'; generated: 2018-12-10 15:34:27
    Updating metadata for 'extras': [=============] 100%
    rpm-md repo 'extras'; generated: 2018-12-10 16:00:03
    Importing metadata [=============] 100%
    Resolving dependencies... done
    Relabeling (5/5) [=============] 100%
    Applying 5 overrides
    Processing packages (10/10) [=============] 100%
    Running pre scripts... 1 done
    Running post scripts... 5 done
    Writing rpmdb... done
    Writing OSTree commit... done
    Copying /etc changes: 42 modified, 5 removed, 613 added
    Transaction complete; bootconfig swap: no; deployment count change: 0
    Freed: 580.5 kB (pkgcache branches: 0)
    Upgraded:
      docker 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
      docker-client 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
      docker-common 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
      docker-lvm-plugin 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
      docker-novolume-plugin 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
    Run "systemctl reboot" to start a reboot
    
  3. Reboot the host.

I carefully did this on one master server after the other and surprisingly all the services (except etcd) were started normally. Even my GlusterFS pods came up again as nothing had happened. But still, the etcd cluster was offline and with it the Master API was inaccessible. No oc commands were possible.

Fixing the etcd Cluster

With the docker issue being fixed, I now had to bring up the etcd cluster again. The database was likely in a confused state because of all the failed attempts before, so I decided to restore a known good state. As briefly mentioned before, to do so, you actually create a new cluster with the database from a backup. Because the OpenShift documentation on how to do this cannot be followed easily, I list the exact steps below how I manged to do it:

  1. Make sure that all the etcd processes are down and not coming up again automatically. On an OpenShift 3.10 cluster, you prevent automatic startup by moving the /etc/origin/node/pod/etcd.yaml definition to a backup location e.g. /etc/origin/node/pod/disabled/ on every etcd host.
  2. First create a new one node etcd cluster on the first etcd host. To do so, we need some preparation:
    • The /etc/etcd/etcd.conf configuration must not contain any previous configurations regarding the INITIAL_CLUSTER or INITIAL_CLUSTER_STATE. I was able to simply use the etcd.conf generated by the upgrade playbook which already set those two variables to the correct value:
      ETCD_INITIAL_CLUSTER=
      ETCD_INITIAL_CLUSTER_STATE=new
      

      Also make sure, the ETCD_INITIAL_ADVERTISE_PEER_URLS only contains the URL of the first host itself and no other peers:

      ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.0.10:2380
      
    • Restore the etcd database from a backup. Fortunately the upgrade playbook automatically created a backup after the etcd upgrade, so I’m going to restore to that state:
      # mv /var/lib/etcd/member /var/lib/etcd/member.orig
      # cp -rP /var/lib/etcd/openshift-backup-post-3.0-20181214022846/member /var/lib/etcd/
      
    • When starting the first etcd member for the first time, we need to pass the --force-new-cluster argument to the process. This will override the cluster definition from the database files. To do so, the etcd.yaml file has to be adjusted. Here the important snippet (everything else should be kept as it is):
      spec:
        containers:
        - args:
          - '#!/bin/sh
      
            set -o allexport
      
            source /etc/etcd/etcd.conf
      
            exec etcd --force-new-cluster
      
            '
      
    • If everything is ready to start the etcd process, move the altered etcd.yaml file back to the /etc/origin/node/pod directory. Within a few moments, the pod should startup and create a new cluster.
  3. Check the initial cluster state via:
    # etcdctl2 cluster-health
    member 67aa8b8cc701 is healthy: got healthy result from https://10.0.0.10:2379
    

    If something went wrong, you might want to check the logs via:

    # /usr/local/bin/master-logs etcd etcd
    
  4. Initially the first member still advertises a PeerURL pointing to ‘localhost’:
    # etcdctl2 member list
    67aa8b8cc701: name=master01.example.com peerURLs=http://localhost:2380 clientURLs=https://10.0.0.10:2379 isLeader=true
    

    This must be updated by the correct host URL pointing to itself:

    # etcdctl2 member update 67aa8b8cc701 https://10.0.0.10:2380
    Updated member with ID 67aa8b8cc701 in cluster
    

    Then it correctly shows:

    # etcdctl2 member list
    67aa8b8cc701: name=master01.example.com peerURLs=https://10.0.0.10:2380 clientURLs=https://10.0.0.10:2379 isLeader=true
    
  5. This configuration was automatically saved in the database. So the --force-new-cluster argument can be removed again. Edit the etcd.yaml in-place to restore the original configuration. After doing so, restart the etcd process with:
    # /usr/local/bin/master-restart etcd
    

    If it comes up again and shows healthy, we can continue the add the other two cluster members.

  6. The following steps to add another cluster member obviously have to be done on for both other etcd hosts:
    1. Add the new host to the cluster by executing the following command on the first etcd host:
      # etcdctl2 member add master02.example.com https://10.0.0.11:2380
      Added member name master02.example.com with ID a6b2e8d0d392083b to cluster
      
      ETCD_NAME="member02.example.com"
      ETCD_INITIAL_CLUSTER="member01.example.com=https://10.0.0.10:2380,member02.example.com=https://10.0.0.11:2380"
      ETCD_INITIAL_CLUSTER_STATE="existing"
      

      The new member will then be displayed as ‘unstarted’ in the member list.

    2. Prepare the /etc/etcd/etcd.conf file on the new etcd host by defining the variables as shown in the output of the etcdctl2 member add command above. The ETCD_INITIAL_CLUSTER value will automatically be extended with each new member added to the cluster.
    3. Delete the old database on the new etcd host. It will automatically be synced from the other cluster members once the new node has joined:
      # mv /var/lib/etcd/member /var/lib/etcd/member.orig
      
    4. Enable the etcd pod by moving the etcd.yaml back to /etc/origin/node/pod. Within a few minutes the etcd process should be started and eventually join the etcd cluster.

Once the etcd cluster was restored, the oc command was finally working again and I could check the state of the etcd pods also via OpenShift client:

$ oc get pods -n kube-system | grep etcd
master-etcd-master01.example.com          1/1       Running   5          1h
master-etcd-master02.example.com          1/1       Running   0          47m
master-etcd-master03.example.com          1/1       Running   0          2m

During the entire time the etcd cluster was down the OpenShift cluster continued running. The registry, routers and applications such as my Gitea setup were online all the time and even the CNS cluster running on the master hosts handled the debugging and restart session with bravery. Fortunately I had a super static setup during that time and so no deployments or replica count enforcement needed to be executed which would have been impossible anyway. Still I feel it’s a positive fact that shows the resiliency the platform has gained over time.

Finishing the Control Plane Upgrade

After a longer detour, I was finally back at the point were I could start another run of the control plane upgrade playbook. Remember, when the playbook aborted before it did so after upgrading the control plane services on the first master node, there are still two to go. So I started the playbook once again.

By now I have a really good feeling about the state of the playbook in this release. As you can see above, it failed on me many times in all different stages of the update, but it always had a good reason and it was always able to pick up where it left. My experience with initial upgrade attempts of earlier OpenShift releases was unfortunately not always that good. For example it happened to me that I had to restore a master host from a snapshot, because the playbook failed to correctly detect the upgrade state in the second run, after it aborted the first run due to a syntax error in a post-upgrade task.

This time the playbook finished successfully and my control plane was finally at release 3.10:

# /usr/local/bin/oc version
oc v3.10.0+c99b16a-90
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://openshift.example.com:8443
openshift v3.10.0+c99b16a-90
kubernetes v1.10.0+b81c8f8

Running the Node Upgrade Playbook

After the control-plane was done, I had to upgrade the infrastructure and compute nodes. A separate playbook placed at playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_nodes.yml is available. Initially I only wanted to run it on a single node to make sure everything works as expected. This can be done by passing the -e openshift_upgrade_nodes_label=kubernetes.io/hostname=node03.example.com argument, where the given host name is obviously the node that should be upgraded, to the playbook execution command. The playbook completed without error already on the first attempt. So I continued with the other nodes.

One fact is super important when upgrading the nodes to OpenShift 3.10. The /etc/origin/node/node-config.yaml is completely regenerated based on the settings in the corresponding node group (and/or the defaults) and so any prior adjustment not reflected in the inventory is lost. Therefore make sure that you perfectly understand the Node Group concept and how it affects your node layout and configuration.

To give you an example how to customize the upgrade behavior on the infrastructure nodes, I added the following arguments to the playbook execution: -e openshift_upgrade_nodes_label=region=infra -e openshift_upgrade_nodes_serial=50%.

Fixing the Infrastructure Node Selector

It confused me that the NodeSelector of the infrastructure components such as the registry and routers were not updated to the new defaults. In the inventory I explicitly defined the new node selector:

openshift_hosted_registry_selector='node-role.kubernetes.io/infra=true'

But when checking the DeploymentConfig of the registry, I can still find the old NodeSelector:

$ oc get dc docker-registry -n default -o json | jq .spec.template.spec.nodeSelector
{
  "region": "infra"
}

So I manually triggered an update of the NodeSelector property in all the DeploymentConfigs using it. E.g.:

$ oc patch dc docker-registry -n default --type json --patch '[{"op":"replace","path":"/spec/template/spec/nodeSelector","value":{"node-role.kubernetes.io/infra":"true"}}]'
deploymentconfig "docker-registry" patched

NodeSelectors can also be set in DaemonSets, as annotations in projects or even globally via master-config.yaml. Therefore make sure to update them all, when required, before removing any labels from the nodes.

After checking that all the pods are up and running again, I was finally able to remove the old infrastructure labels from the nodes:

$ oc label node node01.example.com region- zone-

Summarizing

This was not my first OpenShift update ever, but my first update from 3.9 to 3.10. This obviously means that I made some mistakes and had wrong assumptions from which I did learn a lot. I hope I could share some insights and useful hints for those of you that haven’t done this before. Otherwise it will at least help me in the future to run this update an other cluster much smoother.

At the end some advice for those of you who also need to do such an upgrade:

  • You need to have a test cluster where you can practice such updates. It doesn’t need to be big but the Ansible inventory variables should be structurally as similar as possible to those of the production cluster. As you saw above, a lot of errors just happened due to wrong inventory variables. Ideally the test cluster should have some workload so that you experience how the applications behave during the update and and so that you can test if everything still works after an upgrade.
  • Emphasis your Ansible inventory. Everything of your configuration that can fit into the Ansible inventory must be defined there and must be maintained there. It can cost you a lot of time debugging or even result in application downtime during an upgrade if you manually updated the cluster configuration without adjusting the configuration in the inventory. Even when it sometimes feels like it’s more work than benefit it’s always worth it.
  • Preparation is key. Carefully read through the upstream documentation available. Most likely you also have some internal documentation where your infrastructure specifics are written down. Run the upgrade on a test cluster before you do it in production. If it doesn’t work on the first attempt, update your notes and try it again. Try to gain as much experience as possible on the test infrastructure so that you already know what to do if something goes wrong in production.
  • Plan a lot of time. Doing such an upgrade is a lot of work! Give yourself enough time to do a proper preparation and also the actual upgrade window itself should give you enough time to fix issues when they arise. Plan in the scale of hours or better days. Ansible is slow. If you have to restart the playbook because of an error after 15 minutes this will eat up your time fast.

Thanks for reading. As always I’d welcome some feedback or critics in the comments.

Dec 122018
 

The first thing that someone would need when operating or playing around with OKD (better known as OpenShift) is a git version control service. Personally I’m a fan of Gitea and that’s why I’d like to show a way how to run Gitea in a OpenShift environment. Gitea upstream already provides a great container image which I’m are going to use. But as some of you may have already experienced, running an image on docker and running it in OpenShift is two different pairs of shoes. The fact that the Gitea image runs an integrated SSH server means that it doesn’t simply match the widely discussed Web application pattern. Therefore I’m trying to explain some of the difficulties that one might encounter when moving such an application to OpenShift.

My environment consists of a multi-node OpenShift cluster. Obviously Gitea should be high available so that if a node goes down, one would still be able to access the git repositories. One pod is no pod, so Gitea must be deployed with a replica count of at least two. Accessing the pods over HTTP is already solved by the OpenShift default infrastructure via redundant HAProxy routers. I’d probably explain how to achieve a redundant router setup in one of my next blog posts but this time I’d like to emphasis on the Gitea SSH access via NodePort service feature. The following graphic shows a communication overview of such a setup:

NodePort Service

In OpenShift the Kubernetes Service resource is responsible for directing the traffic (TCP, UDP or SCTP) to the individual application pods. It maps the service name (e.g. ‘gitea’) via SkyDNS to a so-called ClusterIP. This is a virtual IP address that is not assigned to any host or container network interface but still used as packet destination within the cluster SDN (software defined network). After receiving a packet to this so-called ClusterIP the Linux kernel of an OpenShift node rewrites the packet destination to an IP address of an actual application pod and acts as a virtual network load-balancer.

In our example there is a ‘gitea’ service managing the HTTP traffic to port 3000 of the Gitea pod and a ‘gitea-ssh’ service managing the SSH traffic to port 22 of the Gitea pod. Because we can’t use the OpenShift Router as ingress for SSH, the ‘gitea-ssh’ service defines a special type called NodePort. This means that a packet sent to this port (e.g. 30022) on any OpenShift node will be received by the corresponding service and therefore forwarded to a Gitea pod. This is the simplest way how to direct non-HTTP traffic from outside of OpenShift to an application pod and can also be used for e.g. database protocols or Java RMI. Here the corresponding resource definition for the Gitea SSH service:

apiVersion: v1
kind: Service
metadata:
  name: gitea-ssh
spec:
  ports:
    - name: ssh
      nodePort: 30022
      port: 22
      protocol: TCP
      targetPort: 22
  selector:
    app: gitea
    deploymentconfig: gitea
  sessionAffinity: ClientIP
  type: NodePort

The sessionAffinity: ClientIP setting defines “sticky sessions” to avoid distributing multiple requests of the same client to different pods. I didn’t test yet, how SSH would behave without it, but I think it generally makes sense. In a running setup the service additionally shows the discussed ClusterIP which is statically assigned and the endpoints (pod IPs) which may change when pods are started and stopped:

$ oc describe service gitea-ssh
Name:                     gitea-ssh
Namespace:                vcs
Labels:                   app=gitea
                          template=gitea-persistent-template
Annotations:              
Selector:                 app=gitea,deploymentconfig=gitea
Type:                     NodePort
IP:                       172.30.8.9
Port:                     ssh  22/TCP
TargetPort:               22/TCP
NodePort:                 ssh  30022/TCP
Endpoints:                10.129.2.44:22,10.130.3.71:22
Session Affinity:         ClientIP
External Traffic Policy:  Cluster
Events:                   

From within the cluster, the Gitea SSH service can be reached via service name (extended with OpenShift project name, here ‘vcs’) DNS entry or directly via ClusterIP:

$ host gitea-ssh.vcs.svc
gitea-ssh.vcs.svc has address 172.30.8.9

$ ssh git@gitea-ssh.vcs.svc
PTY allocation request failed on channel 0
Hi there, You've successfully authenticated, but Gitea does not provide shell access.
If this is unexpected, please log in with password and setup Gitea under another user.
Connection to gitea-ssh.vcs.svc closed.

From outside the cluster, the Gitea SSH service can be reached via NodePort on any OpenShift node. To avoid a dependency on a single node in the git repository URL you can define multiple DNS entries using the same name (e.g. services.example.com) to all OpenShift node addresses:

$ host services.example.com
services.example.com has address 10.0.0.2
services.example.com has address 10.0.0.3

$ ssh -p 30022 git@services.example.com
PTY allocation request failed on channel 0
Hi there, You've successfully authenticated, but Gitea does not provide shell access.
If this is unexpected, please log in with password and setup Gitea under another user.
Connection to services.example.com closed.

Issues with NodePort

Port Assignment
The NodePort mechanism is allocating the corresponding port on each OpenShift node. To avoid a clash with node services such as the DNS resolver or the OpenShift node service the port range is restricted. It can be configured in the /etc/origin/master/master-config.yml with the option servicesNodePortRange and defaults to 30000-32767. Obviously multiple applications in the same cluster cannot use the same port and traffic to the chosen port must be allowed by the host firewall on the OpenShift nodes.

Node Groups
NodePorts are always allocated on every OpenShift cluster host running the node service which also includes the OpenShift master servers. OpenShift doesn’t provide a way to restrict the involved hosts to a subset. In my example I choose to restrict the hosts receiving traffic by only adding a limited number of nodes to the service DNS entry and block access on the others via iptables. If you don’t use an application load-balancer in front of the OpenShift routers you could also re-use the wildcard DNS entry defined for the HTTP traffic. The NodePort traffic would then follow the same path as the normal Web traffic.

Node Failure
If one OpenShift node goes down a client trying to access the Gitea SSH service might still try to connect to the unreachable host. Fortunately, the default SSH implementation used by the git command line client is quite tolerant and simply retries with another IP address. When testing this case I therefore haven’t experienced a major issue except a slight connection delay. The failure behavior might be different for other git clients or other application protocols altogether and is definitely not ideal but simple instead.

One way to improve this failure scenario would be to add a real TCP load-balancer in front of the NodePort but then there would be another piece of infrastructure that must be managed synchronously with the OpenShift infrastructure and which might be a new single point of failure.

Container with root Permissions

When starting the upstream Gitea container image in OpenShift you might likely encounter a startup failure with the following error message in the log:

s6-svscan: fatal: unable to mkfifo .s6-svscan/control: Permission denied

The Gitea image, and many other docker images not optimized for the pod concept introduced by Kubernetes, doesn’t start a single application process but a supervisor process (in this case s6) which then spawns multiple different application processes defined in /etc/s6. To do so it wants to create a FIFO in the /etc/s6/.s6-svscan directory which is only writable by the root user which fails as by default processes are started with a random unprivileged account.

Security Context Constraints

Unlike docker, OpenShift controls the actions a pod can do and access with a tight set of rules called Security Context Constraints (SCC). By default the ‘default’ ServiceAccount used to run the application pods is a member of the ‘restricted’ SCC which among other things defines the previously mentioned randomized UID. As Gitea won’t work like this, a less restrictive SCC must be used. After reading the documentation we find, that there is already a predefined SCC which grants us just enough permissions to start our container process as root user without weakening too many other restrictions. The SCC we are heading for is ‘anyuid’. Below I’ll present different approaches how this SCC can now be assigned to the Gitea deployment:

  • The OpenShift cluster administrator can add the ‘default’ ServiceAccount of a project to the list of users in the SCC definition. This doesn’t need any special configuration in the DeploymentConfig of the application but also grants every deployment in the corresponding project ‘anyuid’ privileges. In our setup this would be done with the following command, when assuming Gitea should be deployed in the ‘vcs’ project:
    $ oc adm policy add-scc-to-user anyuid system:serviceaccount:vcs:default
    

    I’m not in favor of this approach as it “hides” the additional permissions in the default ServiceAccount and prone to break the principle of least privilege by assigning the SCC to potentially more applications than necessary.

  • Another approach is using a dedicated ServiceAccount for the Gitea deployment and only adding that to the ‘anyuid’ SCC. The project owner can create a ServiceAccount with:
    $ oc create serviceaccount gitea
    

    The cluster administrator then has to add it to the SCC as before:

    $ oc adm policy add-scc-to-user anyuid system:serviceaccount:vcs:gitea
    

    In the DeploymentConfig the ServiceAccount must be referenced with a entry under the spec.template.spec key:

    $ oc patch dc/gitea --patch '{"spec":{"template":{"spec":{"serviceAccountName": "gitea"}}}}'
    

    The dedicated ServiceAccount used in this approach already hints that there might be special privileges connected to it and is in my opinion easier to audit. The disadvantage however is the more complex configuration.

  • Instead of adding every user account individually to the SCC a dedicated user group could be created having the SCC assigned to this group. Individual ServiceAcccountss would then be added to the group and therefore inherit the SCC. This would follow common practice in identity management to assign permissions to users via privilege groups. Additionally a group management role could be created which then would permit dedicated users not having the ‘cluster-admin’ privilege to manage the group membership.
  • Unfortunately I couldn’t figure out a true self-service model where a responsible project admin could expand the necessary permissions without the possibility to interfere with other projects. In the documentation of OpenShift (<=3.7) I found a hint that it is/was(?) possible to extend the default ServiceAccounts available after creating a new project by adding the account name (e.g. ‘anyuid-service-account’) to the serviceAccountConfig.managedNames list in /etc/origin/master/master-config.yml. While this configuration is still present in newer master-config.yml, the documentation is gone and I also didn’t find a way how to automatically add a user created like this to the ‘anyuid’ SCC. Maybe it’s possible by somehow modifying the project template. If you have done this before or at least have an idea how this could be done, please drop me a line.

At the end, the way how the ‘anyuid’ SCC is assigned to the Gitea application is unimportant as long as the application pod is allowed to start the s6 supervisor process with root permissions.

Gitea Application Template

The way how OpenShift administrators can provide an application setup ready for instantiation by OpenShift project owners is through Templates. Inspired by the My journey through Openshift blog post, I wanted to create my own Gitea template fixing some issues found in the original template and extending it with the opinionated configuration presented above. You can download it from here.

The template is able to automatically setup Gitea with exception of the ‘anyuid’ SCC configuration. It requires a persistent volume (PV) for storing the git repositories and some static configurations such as the SSH authorized_keys file. By default it will use a SQLite database backend which is also stored in the PV. Optionally you can also give the connection string and credentials of a PostgreSQL or MariaDB backend which can run on OpenShift or externally.

If you want the template to be available in the Service Catalog the YAML file has to be applied to the ‘openshift’ project by a cluster administrator:

$ oc create -f gitea-persistent-template.yaml -n openshift

Afterwards it can be instantiated by any project admin via Service Catalog Web-UI or from the command line via:

$ oc new-project vcs
$ oc new-app --template=gitea-persistent -p HTTP_DOMAIN=git.example.com -p SSH_DOMAIN=services.example.com

Alternatively, if no Service Catalog is available, or the template shouldn’t be loaded to OpenShift, the application can also be created directly from the YAML file via:

$ oc new-app -f gitea-persistent-template.yaml -p HTTP_DOMAIN=git.example.com -p SSH_DOMAIN=services.example.com

IMPORTANT: The template will configure Gitea to use a ServiceAccount named according to the parameter APPLICATION_NAME (defaults to ‘gitea’). It must be added to the ‘anyuid’ SCC as described above. E.g.:

$ oc adm policy add-scc-to-user anyuid system:serviceaccount:vcs:gitea

If you have some feedback regarding the template or troubles using it, please open a Github issue. Comments, corrections or general feedback to my article can be posted below. Thanks for reading.

Feb 152018
 

The recently disclosed Spectre and Meltdown CPU vulnerabilities are some of the most dramatic security issues in the recent computer history. Fortunately even six weeks after public disclosure sophisticated attacks exploiting these vulnerabilities are not yet common to observe. Fortunately, because the hard- and software vendors are still stuggling to provide appropriate fixes.

If you happen to run a Linux system, an excellent tool for tracking your vulnerability as well as the already active mitigation strategies is the spectre-meltdown-checker script originally written and maintained by Stéphane Lesimple.

Within the last month I set myself the target to bring this script to Fedora and EPEL so it can be easily consumed by the Fedora, CentOS and RHEL users. Today it finally happend that the spectre-meltdown-checker package was added to the EPEL repositories after it is already available in the Fedora stable repositories since one week.

On Fedora, all you need to do is:

dnf install spectre-meltdown-checker

After enabling the EPEL repository on CentOS this would be:

yum install spectre-meltdown-checker

The script, which should be run by the root user, will report:

    • If your processor is affected by the different variants of the Spectre and Meltdown vulnerabilities.
    • If your processor microcode tries to mitigate the Spectre vulnerability or if you run a microcode which
      is known to cause stability issues.
    • If your kernel implements the currently known mitigation strategies and if it was compiled with a compiler which is hardening it even more.
    • And eventually if you’re (still) affected by some of the vulnerability variants.
  • On my laptop this currently looks like this (Note, that I’m not running the latest stable Fedora kernel yet):

    # spectre-meltdown-checker                                                                                                                                
    Spectre and Meltdown mitigation detection tool v0.33                                                                                                                      
                                                                                                                                                                              
    Checking for vulnerabilities on current system                                       
    Kernel is Linux 4.14.14-200.fc26.x86_64 #1 SMP Fri Jan 19 13:27:06 UTC 2018 x86_64   
    CPU is Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz                                      
                                                                                                                                                                              
    Hardware check                            
    * Hardware support (CPU microcode) for mitigation techniques                         
      * Indirect Branch Restricted Speculation (IBRS)                                    
        * SPEC_CTRL MSR is available:  YES    
        * CPU indicates IBRS capability:  YES  (SPEC_CTRL feature bit)                   
      * Indirect Branch Prediction Barrier (IBPB)                                        
        * PRED_CMD MSR is available:  YES     
        * CPU indicates IBPB capability:  YES  (SPEC_CTRL feature bit)                   
      * Single Thread Indirect Branch Predictors (STIBP)                                                                                                                      
        * SPEC_CTRL MSR is available:  YES    
        * CPU indicates STIBP capability:  YES                                           
      * Enhanced IBRS (IBRS_ALL)              
        * CPU indicates ARCH_CAPABILITIES MSR availability:  NO                          
        * ARCH_CAPABILITIES MSR advertises IBRS_ALL capability:  NO                                                                                                           
      * CPU explicitly indicates not being vulnerable to Meltdown (RDCL_NO):  UNKNOWN    
      * CPU microcode is known to cause stability problems:  YES  (Intel CPU Family 6 Model 61 Stepping 4 with microcode 0x28)                                                
                                              
    The microcode your CPU is running on is known to cause instability problems,         
    such as intempestive reboots or random crashes.                                      
    You are advised to either revert to a previous microcode version (that might not have
    the mitigations for Spectre), or upgrade to a newer one if available.                
    
    * CPU vulnerability to the three speculative execution attacks variants
      * Vulnerable to Variant 1:  YES 
      * Vulnerable to Variant 2:  YES 
      * Vulnerable to Variant 3:  YES 
    
    CVE-2017-5753 [bounds check bypass] aka 'Spectre Variant 1'
    * Mitigated according to the /sys interface:  NO  (kernel confirms your system is vulnerable)
    > STATUS:  VULNERABLE  (Vulnerable)
    
    CVE-2017-5715 [branch target injection] aka 'Spectre Variant 2'
    * Mitigated according to the /sys interface:  YES  (kernel confirms that the mitigation is active)
    * Mitigation 1
      * Kernel is compiled with IBRS/IBPB support:  NO 
      * Currently enabled features
        * IBRS enabled for Kernel space:  NO 
        * IBRS enabled for User space:  NO 
        * IBPB enabled:  NO 
    * Mitigation 2
      * Kernel compiled with retpoline option:  YES 
      * Kernel compiled with a retpoline-aware compiler:  YES  (kernel reports full retpoline compilation)
      * Retpoline enabled:  YES 
    > STATUS:  NOT VULNERABLE  (Mitigation: Full generic retpoline)
    
    CVE-2017-5754 [rogue data cache load] aka 'Meltdown' aka 'Variant 3'
    * Mitigated according to the /sys interface:  YES  (kernel confirms that the mitigation is active)
    * Kernel supports Page Table Isolation (PTI):  YES 
    * PTI enabled and active:  YES 
    * Running as a Xen PV DomU:  NO 
    > STATUS:  NOT VULNERABLE  (Mitigation: PTI)
    
    A false sense of security is worse than no security at all, see --disclaimer
    

    The script also supports a mode which outputs the result as JSON, so that it can easily be parsed by any compliance or monitoring tool:

    # spectre-meltdown-checker --batch json 2>/dev/null | jq
    [
      {
        "NAME": "SPECTRE VARIANT 1",
        "CVE": "CVE-2017-5753",
        "VULNERABLE": true,
        "INFOS": "Vulnerable"
      },
      {
        "NAME": "SPECTRE VARIANT 2",
        "CVE": "CVE-2017-5715",
        "VULNERABLE": false,
        "INFOS": "Mitigation: Full generic retpoline"
      },
      {
        "NAME": "MELTDOWN",
        "CVE": "CVE-2017-5754",
        "VULNERABLE": false,
        "INFOS": "Mitigation: PTI"
      }
    ]
    

    For those who are (still) using a Nagios-compatible monitoring system, spectre-meltdown-checker also supports to be run as NRPE check:

    # spectre-meltdown-checker --batch nrpe 2>/dev/null ; echo $?
    Vulnerable: CVE-2017-5753
    2
    

    I just mailed to Stéphane and he will soon release version 0.35 with many new features and fixes. As soon as it will be released I’ll submit a package update, so that you’re always up to date with the latest developments.

    Dec 202016
     

    Since a long time I’m using and following the development of the LXC (Linux Container) project. I feel that it unfortunately never really had the success it deserved and in the recent years new technologies such as Docker and rkt pretty much redefined the common understanding of a container according to their own terms. Nonetheless LXC still claims its niche as full Linux operating system container solution especially suited for persistent pet containers, an area where the new players on the market are still in the stage of figuring out how to implement this properly according to their concept. LXC development hasn’t stalled, quite the contrary, they extended the API with a HTTP REST interface (served via Linux Container Daemon, LXD), implemented support for container live-migration, added container image management and much more. This means that there are a lot of reasons why someone, including me, would want to use Linux containers and LXD.

    Enable LXD COPR repository
    LXD is not officially packaged for Fedora. Therefore I spent the last few weeks by creating some community packages via their COPR build system and repository service. Similar to the better known Ubuntu PPA (Personal Package Archive) system, COPR provides a RPM package repository which can easily be consumed by Fedora users. To use the LXD repository, all you need to do is enabling it via dnf:

    # dnf copr enable ganto/lxd
    

    Please note that COPR packages are not reviewed by the Fedora package maintainers therefore you should only install packages where you trust the author. For this reason I also provide a Github repository with the RPM spec files, so that everyone could also build the RPMs on their own if they feel uncomfortable using the pre-built RPMs from the repository.

    Install and start LXD
    LXD is split into multiple packages. The important ones are lxd, the Linux Container Daemon and lxd-client, the LXD client binary called lxc. Install them with:

    # dnf install lxd lxd-client
    

    Unfortunately I didn’t had time to figure out the correct SELinux labels for LXD yet, therefore you need to disable SELinux prior to starting the daemon. LXD supports user namespaces to map the root user in a container to an unprivileged user ID on the container host. For this you need to assign an UID range on the host:

    # echo "root:1000000:65536" >> /etc/subuid
    # echo "root:1000000:65536" >> /etc/subgid
    

    If you don’t do this, user namespaces won’t be used which is indicated by a message such as:

    lvl=warn msg="Error reading idmap" err="User \"root\" has no subuids."
    lvl=warn msg="Only privileged containers will be able to run"
    

    Eventually start LXD with:

    # systemctl start lxd.service
    

    LXD configuration
    LXD doesn’t have a configuration file. Configuration properties must be set and retrieved via client commands. Here you can find a list of all supported configuration properties. Most tutorials will suggest to initially run lxd init which would generate a basic configuration. However there is only a limited set of configuration options available via this command and therefore I prefer to set the properties via LXD client. A normal user account can be used to manage LXD via client when it’s a member of the lxd POSIX group:

    # usermod --append --groups lxd myuser
    

    By default LXD will store its images and containers in directories under /var/lib/lxd. Alternatives storage back-ends such as LVM, Btrfs or ZFS are available. Here I will show an example how to use LVM. Similar to the recommended Docker setup on Fedora it will use LVM thin volumes to store images and containers. First create a LVM thin pool. For this we still need some space available on the default volume group. Alternatively you can use a second disk with a dedicated volume group. Replace vg00 with the volume group name you want to use:

    # lvcreate --size 20G --type thin-pool --name lxd-pool vg00
    

    Now we set this thin pool as storage back-end in LXD:

    $ lxc config set storage.lvm_vg_name vg00
    $ lxc config set storage.lvm_thinpool_name lxd-pool
    

    For each image which is downloaded LXD will create a thin volume storing the image. If a new container is instantiated a new writeable snapshot will be created from which you can create an image again or make further snapshots for fast roll-back. By default the container file system will be ext4. If you prefer XFS, it can be set with the following command:

    $ lxc config set storage.lvm_fstype xfs
    

    Also for networking various options are available. If you ran lxd init, you may have already created a lxdbr0 network bridge. Otherwise I will show you how to manually create one in case you want a dedicated container bridge or attach LXD to an already existing bridge which would be configured through an external DHCP server.

    To create a dedicated network bridge where the traffic will be NAT‘ed to the outside, run:

    $ lxc network create lxdbr0
    

    This will create a bridge device with the given name and also start-up a dedicated instance of dnsmasq which will act as DNS and DHCP server for the container network.

    A big advantage of LXD in comparison to plain LXC is a feature called container profiles. There you can define settings which should be applied to a new container instance. In our case, we now want containers to use the network bridge created before or any other network bridge which was created independently. For this it will be added to the “default” profile which is applied by default when creating a new container:

    $ lxc network attach-profile lxdbr0 default eth0
    

    The eth0 is the network device name which will be used inside the container. We could also add multiple network bridges or create multiple profiles (lxc profile create newprofile) with different network settings.

    Create a container
    Finally we have the most important pieces together to launch a container. A container is always instantiated from an image. The LXC projects provides an image repository with a big number of prebuilt container images pre-configured under the remote name images:. The images are regular LXC containers created via upstream lxc-create script using the various distribution templates. To list the available images run:

    $ lxc image list images:
    

    If you found an image you want to run, it can be started as following. Of course in my example I will use a Fedora 24 container (unfortunately there are no Fedora 25 containers available yet, but I’m also working on that):

    $ lxc launch images:fedora/24 my-fedora-container
    

    With the following command you can create a console session into the container:

    $ lxc exec my-fedora-container /bin/bash
    

    I hope this short guide made you curious to try LXD on Fedora. I’m glad to hear some feedback via comments or Email if you find this guide or the my COPR repository useful or if you have some corrections or found some issues.

    Further reading
    If you want to know more about how to use the individual features of LXD, I can recommend the how-to series of Stéphane Graber, one of the core developers of LXC/LXD:

    Sep 232016
     

    Currently, I’m working on automating the setup of a authoritative DNS server, namely gdnsd. There are many nice features in gdnsd, but what might be interesting for you, that it requires the zone data to be in the regular RFC1035 compliant format. This is also true for bind, probably the most widely used DNS server, therefore the approach explained here, could be also used for bind. Again I wanted to use Ansible as automation framework, not only to setup and configure the service, but also to generate the DNS zone file. One reason for this is, because gdnsd doesn’t support zone transfers, therefore Ansible should be used as synchronization mechanism and because in my opinion the JSON-based inventory format is a simple, generic but very powerful data interface. Especially when considering the dynamic inventory feature of Ansible, one is completely free where and how to actually store the configuration data.

    There are already a number of Ansible bind roles available, however they mostly use a very simple approach when it’s about configuring the zone file and its serial. When generating zone files with an automation tool the trickiest part is the handling of the serial number, which has to be increased on every zone update. I’d like to explain what solutions I implemented to solve this challenge.

    Zone data generation must be idempotent
    One strength of Ansible is, that it can be run over and over again and only ever changes something in the system if the current state is not as desired. In my context this means, that the zone file only needs to be updated if the zone data from the inventory has changed. Therefore, also the serial number only ever has to be updated in that case. But how to know if the data has changed?

    Using the powerful Jinja2 templating engine, I defined dictionary and assigned every value which would later go into the zone file. Then, I’ll create a checksum over the dictionary content and save it as comment into the zone file. If the checksum changed, the serial has to be updated. Otherwise the zone file includes the old serial and nothing changed. In practice this would look like this:

    1. Read the hash and serial which are saved as comment in the existing zone file and register a temporary variable. It will be empty if the zone file doesn’t exist yet:
      - name: Read zone hash and serial
        shell: 'grep "^; Hash:" /etc/gdnsd/zones/example.com || true'
        register: gdnsd__register_hash_and_serial
        [...]
      
    2. Define a task which will update the zone file:
      - name: Generate forward zones
        template:
          src: 'etc/gdnsd/zones/forward_zone.j2'
          dest: '/etc/gdnsd/zones/example.com'
          [...]
      
    3. In the template, create a dictionary holding the zone data:
      {% set _zone_data = {} %}
      {% set _ = _zone_data.update({'ttl': item.ttl}) %}
      {% set _ = _zone_data.update({'domain': 'example.com'}) %}
      [...]
      
    4. Create an intermediate variable _hash_and_serial holding the hash and serial read from the zone file before:
      {% set _hash_and_serial = gdnsd__register_hash_and_serial.stdout.split(' ')[2:] %}
      
    5. Create a hash from the final _zone_data dictionary, compare it with the hash (first element) in _hash_and_serial. If the hashes are equal set the serial as read before (second element) in _hash_and_serial. Otherwise set a new serial which was previously saved in gdnsd__fact_zone_serial (see following section):
      {% set _zone = {'hash': _zone_data | string | hash('md5')} %}
      {% if _hash_and_serial and _hash_and_serial[0] == _zone['hash'] %}
      {%   set _ = _zone.update({'serial': _hash_and_serial[1]}) %}
      {% else %}
      {%   set _ = _zone.update({'serial': gdnsd__fact_zone_serial}) %}
      {% endif %}
      
    6. Safe the final hash and serial as comment to the zone file:
      ; Hash: {{ _zone['hash'] }} {{ _zone['serial'] }}
      

    Identical zone serial on distributed servers
    I didn’t explain yet, how gdnsd__fact_zone_serial is defined. Initially, I simply had ansible_date_time.epoch, which corresponds the Unix time, assigned to the serial. This is the simplest way to make sure the serial is numerical and each zone update results in an increased value. However, in the introduction I also mentioned the issue of distributing the zone files between a set of DNS servers. Obviously, if they have the same zone data, they must also have the same serial.

    To make sure multiple servers are using the same serial for a zone update, the serial is not computed individually in each template task execution, but once for each playbook run. In Ansible, one can specify that a task must only run once, even the playbook is executed on multiple servers. Therefore I defined such a task to store the Unix time in the temporary fact gdnsd__fact_zone_serial which is used in the zone template on all servers:

    - name: Generate serial
      set_fact:
        gdnsd__fact_zone_serial: '{{ ansible_date_time.epoch }}'
      run_once: True
    

    This approach is still not perfect. It won’t compare the two generated zone files between a set of servers. So you have to make sure that the zone data in the inventory is the same for all servers. Also, if you update the servers individually, the serial is generated twice and therefore are different, even when the zone data is identical. At the moment I can’t see any elegant approach to solve those issues. If you have some ideas, please let me know…

    The example code listed above is a simplified version of my real code. If you are interested in the entire role, have a look at github.com: ganto/ansible-gdnsd. I hope this could give you some useful examples for using some of the more advanced Ansible features in a real-world scenario.

    Sep 052016
     

    Most of my readers must have heard about the “Let’s encrypt” public certificate authority (CA) by now. For those who haven’t: About two years ago, the Internet Security Research Group (ISRG), a public benefit group, supported by the Electronic Frontier Foundation (EFF), the Mozilla Foundation, Cisco, Akamai, the Linux Foundation and many more started the challenge to create a fully trusted public key infrastructure which can be used for free by everyone. Until then, the big commercial certificate authorities such as Comodo, Symantec, GlobalSign or GoDaddy dominated the market of SSL certificates which prevented a wide use of trusted encryption. The major goal of the ISRG is to increase the use of HTTPS for Web sites from then less than 40 percent two years ago to a 100 percent. One step to achieve this, is by providing certificates to everyone for free and the other step, to do this in a fully automated way. For this reason a new protocol called Advanced Certificate Management Environment (ACME) was designed and implemented. Going forward to today: The “Let’s encrypt” CA issued already more than five million certificates and the use of HTTPS is increasing to around 45 percent in June 2016.

    acme-tiny is a small Python script which can be used to submit the certificate request to the “Let’s encrypt” CA. If you’re eligible to request a certificate for this domain you instantly get the certificate back. As such a certificate is only valid for 90 days and the renewal process doesn’t need any user interaction it’s a perfect opportunity for a fully automated setup.

    Since a while I prefer Ansible for all kind of automation tasks. “Let’s encrypt” finally allows me to secure new services, which I spontaneously decide to host on my server via sub-domains. To ease the initial setup and fully automate the renewal process, I wrote an Ansible role ganto.acme_tiny. It will run the following tasks:

    • Generate a new RSA key if none is found for this domain
    • Create a certificate signing request
    • Submit the certificate signing request with help of acme-tiny to the “Let’s encrypt” CA
    • Merge the received certificate with the issuing CA certificate to a certificate chain which then can be configured for various services
    • Restart the affected service to load the new certificate

    In practice, this would look like this:

    • Create a role variable file /etc/ansible/vars/mail.linuxmonk.ch.yml:
      acme_tiny__domain: [ 'mail.linuxmonk.ch', 'smtp.linuxmonk.ch' ]
      acme_tiny__cert_type: [ 'postfix', 'dovecot' ]
    • Make sure the involved service configurations load the certificate and key from the correct location (see ganto.acme_tiny: Service Configuration).
    • Run the playbook with the root user to do the initial setup:

      $ sudo ansible-playbook \
      -e @/etc/ansible/vars/mail.linuxmonk.ch.yml \
      /etc/ansible/playbooks/acme_tiny.yml

    That’s it. Both SMTP and IMAP are now secured with help of a “Let’s encrypt” certificate. To setup automated certificate renewal I only have to add the executed command in a task scheduler such as cron from where it will be executed as unprivileged user acmetiny which was created during the initial playbook run. E.g. in /etc/cron.d/acme_tiny:

    PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
    
    @monthly acmetiny /usr/bin/ansible-playbook -e @/etc/ansible/vars/mail.linuxmonk.ch.yml /etc/ansible/playbooks/acme_tiny.yml >/dev/null
    

    If you became curious and want to have a setup like this yourself, checkout the extensive documentation about the Ansible role at Read the Docs: ganto.acme_tiny.

    This small project was also a good opportunity for me, to integrate all the nice free software-as-a-service offers the Internet is providing for a (Ansible role) developer nowadays:

    • The code “project” is hosted and managed on Github.
    • Every release and pull request is tested via the Travis-CI continuous integration platform. It makes use of the rolespec Ansible role testing framework for which a small test suite has been written.
    • Ansible Galaxy is used as a repository for software distribution.
    • The documentation is written in a pimped version of Markdown, rendered via Sphinx and hosted on Read the Docs from where it can be accessed and downloaded in various formats.

    That’s convenient!

    Apr 212016
     

    I recently had the task to setup and test a new Linux Internet gateway host as a replacement to an existing router. The setup is classical with some individual ports such as HTTP being forwarded with DNAT to some backend systems.

    Redundant router setup requiring policy routing

    Redundant router setup requiring policy routing

    The new router, from now on I will call it router #2, should be tested and put into operation without downtime. Obviously this means that I had to run the two routers in parallel for a while. The backend systems however, only know one default gateway. Accessing a service through a forwarded port on router #2 resulted in a timeout, as the backend system sent the replies to the wrong gateway where they were dropped.

    Fortunately iptables and iproute2 came to my rescue. They enable you to implement policy routing on Linux. This means that the routing decision is not (only) made based on the destination address of a packet as in regular routing, but additional rules are evaluated. In my case: Every connection opened through router #2 has to be replied via router #2.

    Using iptables/iproute2 for this task this means: Incoming packets with the source MAC address from router #2 are marked with help of the iptables ‘mark’ extension. The iptables ‘connmark’ extension will then help to associate outgoing packets to the previously marked connection. Based on the mark of the outgoing packet a custom routing policy will set the default gateway to router #2. Easy, eh?

    Configuration
    Now I’ll show some commands how this can be accomplished. The following commands assume that the iptables rule set is still empty and are for demonstration purpose only. Likely they have to be adjusted slightly in a real configuration.

    First the routing policy will be setup:

    1. Define a custom routing table. There exist some default tables, so the custom entry shouldn’t overlap with those. For better understanding I will call it ‘router2’:

      # echo "200 router2" >> /etc/iproute2/rt_tables

    2. Add a rule to define the condition which packets should lookup the routing in the previously created table ‘router2’:
      &nbps;
      # ip rule add fwmark 0x2 lookup router2

      This means that IP packets with the mark ‘2’ will be routed according to the table ‘router2’.

    3. Set the default gateway in the ‘router2’ table to the IP address of router #2 (e.g. 10.0.0.2):

      # ip route add default via 10.0.0.2 table router2

    4. To make sure the routing cache is rebuilt, it need to be flushed after changes:
      # ip route flush cache

    Afterwards I had to make sure that the involved connections coming from router #2 are marked appropriately (above the mark ‘2’ was used). The ‘mangle’ table is a part of the Linux iptables packet filter and meant for modifying network packets. This is the place where the packet markings will be set.

    1. The first iptables rule will match all packets belonging to a new connection coming from router #2 and sets the previously defined mark ‘2’:

      # iptables --table mangle --append INPUT \
      --protocol tcp --dport 80 \
      --match state --state NEW \
      --match mac --mac-source 52:54:00:c2:a5:43 \
      ! --source 10.0.0.0/24 \
      --jump MARK --set-mark 0x2

      The packets being marked are restricted to meet the following requirements:

      • being sent by the network adapter of router #2 (--mac-source)
      • don’t originate in the local network (! --source)
      • target destination port 80 (--dport: example for the HTTP port being forwarded by router #2)
      • belong to a new connection (--state NEW)

      Of course additional (or less) extensions can be used to filter the packets according to individual requirements.

    2. Next, the incoming packets are given to the ‘connmark’ extension which will do the connection tracking of the marked connections:

      # iptables --table mangle --append INPUT \
      --jump CONNMARK --save-mark

    3. The packets which can be associated with an existing connection are also marked accordingly:

      # iptables --table mangle --append INPUT \
      --match state --state ESTABLISHED,RELATED \
      --jump CONNMARK --restore-mark

    4. All the previous rules where required that the outgoing packets can finally be marked too:

      # iptables --table mangle --append OUTPUT \
      --jump CONNMARK --restore-mark

    Debugging
    The following commands and iptables rules should help when setting up and/or debugging policy routing with marked packets:

    • List policy of routing table ‘table2’:

      # ip route show table router2

    • List defined routing tables:

      # cat /etc/iproute2/rt_tables

    • Log marked incoming/outgoing packets to syslog:

      # iptables -A INPUT -m mark --mark 0x2 -j LOG
      # iptables -A OUTPUT -m mark --mark 0x2 -j LOG

    Dec 032015
     

    SuperMicro server mainboards often include a dedicated Baseband Management Controller (BMC) which offers out-of-band management of the server system. This is a dedicated embedded chip (Specifications) which allow to power cycle the machine, monitor hardware variables, update firmware, access the operating system console and much more. If you are used to big brand server systems, then you might be more familiar with the term iLO, which is the HP Integrated Lights-Out, or IMM, which is the IBM Integrated Management Module. The basic functionality is always comparable.

    FreeIPMI Management Utility

    A nice fact is, that most of these out-of-band management solutions support the Intelligent Platform Management Interface (IPMI) specification, which defines an interface for accessing the various functionalities from an operating system (OS) or the network (e.g. for monitoring or OS recovery). For Linux there exists GNU FreeIPMI which is a collection of powerful tools for accessing IPMI-compatible BMCs. It is available via default package manager on all major Linux distributions.

    Linux Kernel Configuration
    As mentioned before, on Linux there are two different ways to access the BMC via IPMI. The first way is directly from the host running on the board connected to the BMC. The connection is done via kernel modules. Make sure you have at least the ipmi_defintf and ipmi_si modules loaded. If your kernel doesn’t include this modules, you might need to add them to your kernel configuration. The corresponding settings can be found under:

    Device Drivers --->
       Character Devices --->
          <M> IPMI top-level message handler --->
              [ ]   Generate a panic event to all BMCs on a panic (NEW)
              <M>   Device interface for IPMI (NEW)
              <M>   IPMI System Interface handler (NEW)
              <M>   IPMI SMBus handler (SSIF) (NEW)
              <M>   IPMI Watchdog Timer (NEW)
              <M>   IPMI Poweroff (NEW)
    

    You can check if it’s working by invoking ipmi-detect which should then output something like this:

    # ipmi-locate
    Probing KCS device using DMIDECODE... done
    IPMI Version: 2.0
    IPMI locate driver: DMIDECODE
    IPMI interface: KCS
    BMC driver device: 
    BMC I/O base address: 0xCA2
    Register spacing: 1
    
    Probing SMIC device using DMIDECODE... FAILED
    
    Probing BT device using DMIDECODE... FAILED
    
    Probing SSIF device using DMIDECODE... FAILED
    
    Probing KCS device using SMBIOS... done
    IPMI Version: 2.0
    IPMI locate driver: SMBIOS
    IPMI interface: KCS
    BMC driver device: 
    BMC I/O base address: 0xCA2
    Register spacing: 1
    
    Probing SMIC device using SMBIOS... FAILED
    
    Probing BT device using SMBIOS... FAILED
    
    Probing SSIF device using SMBIOS... FAILED
    
    [...]
    

    There was an IPMI 2.0 compatible chip found, so we are good to go.

    Network Configuration
    The second way to connect via IPMI to the BMC is over the network. For this, there often is a dedicated network port on the mainboard that obviously needs to be connected to a network switch over which you can access it. By default the SuperMicro BMC is configured to get an IP address via DHCP. If the local IPMI connection is working, you can get the IP address by executing:

    # ipmi-config --checkout --section Lan_Conf
    
    #
    # Section Lan_Conf Comments 
    #
    # In the Lan_Conf section, typical networking configuration is setup. Most users 
    # will choose to set "Static" for the "IP_Address_Source" and set the 
    # appropriate "IP_Address", "MAC_Address", "Subnet_Mask", etc. for the machine. 
    #
    Section Lan_Conf
            ## Possible values: Unspecified/Static/Use_DHCP/Use_BIOS/Use_Others
            IP_Address_Source                             Use_DHCP
            ## Give valid IP address
            IP_Address                                    192.168.1.15
            ## Give valid MAC address
            MAC_Address                                   0C:C4:7A:73:18:4F
            ## Give valid Subnet Mask
            Subnet_Mask                                   255.255.255.0
            ## Give valid IP address
            Default_Gateway_IP_Address                    192.168.1.1
            ## Give valid MAC address
            Default_Gateway_MAC_Address                   00:0D:B9:22:15:A2
            ## Give valid IP address
            Backup_Gateway_IP_Address                     0.0.0.0
            ## Give valid MAC address
            Backup_Gateway_MAC_Address                    00:00:00:00:00:00
    EndSection
    

    Otherwise you might need to search the DHCP server log for an entry, which IP address was given to the BMC. I recommend to set the BMC IP address to a fixed value, to minimize the risk to loose connectivity in case of major data center issue. This can be done via Web interface at e.g. https://192.168.1.15 or via ipmi-config:

    # ipmi-config --commit -e Lan_Conf:IP_Address_Source=Static
    # ipmi-config --commit -e Lan_Conf:IP_Address=192.168.1.42
    

    Alternatively you can write the configuration in file and commit it with the --filename argument. E.g. create a static-network.conf:

    # Static network configuration for the BMC
    Section Lan_Conf
            IP_Address_Source                             Static
            IP_Address                                    192.168.1.42
    EndSection
    

    Unfortunately the IP_Address_Source key cannot be changed concurrently with other Lan_Conf keys. So we had to run the commit command twice, when specifying the new values on the command line and the same is true with the configuration file option. The first time it will set the IP_Address_Source:

    # ipmi-config --commit --filename=static-network.conf
    

    And the second time, it will set the new IP_address:

    # ipmi-config --commit --filename=static-network.conf
    

    However, this is an exception and most other key/value pairs can be changed concurrently. See the ipmi-config(8) man-page for more details.

    Eventually we can check if the IPMI interface is reachable via network. This can be done with the ipmi-ping tool. If the --verbose argument is added, it will even output some information about the configured authentication settings. E.g.:

    # ipmi-ping --count 1 --verbose 192.168.1.42
    ipmi-ping 192.168.1.42 (192.168.1.42)
    response received from 192.168.1.42: rq_seq=23, auth: none=clear md2=set md5=set password=set oem=clear anon=clear null=clear non-null=set user=clear permsg=clear 
    --- ipmi-ping 192.168.1.42 statistics ---
    1 requests transmitted, 1 responses received in time, 0.0% packet loss
    
    Configure Serial-over-LAN (SoL) Console Access

    A neat feature of a BMC is the ability to remotely access the Linux console even when the regular network connection to the server is not working any more. However, this needs some additional configuration in the operating system.

    Check SoL Settings in BIOS
    First, we will check the BIOS settings to make sure the SoL feature is enabled. Note that accessing the BIOS usually already works in the default configuration, so if you are adventurous, you can already connect over remote IPMI (for demonstration purpose I’ll use the default user ADMIN):

    $ ipmi-console -h 192.168.1.42 -u ADMIN -P
    

    By default you won’t have access to the Linux system yet, so reboot the system via SSH so the BIOS can be setup accordingly. Unfortunately, the IPMI console won’t show the system’s POST output, so you have to keep pressing the DEL button in the terminal with the open IPMI console until you see the BIOS main screen. Then go to Advanced -> Serial Port Console Redirection and make sure, the SOL Console Redirection is enabled. Also remember how many additional serial ports you have available for configuration in this menu. This will become important when selecting the correct console in the Linux configuration. E.g. my board only has one other serial port (COM1) available and for this one console redirection one mustn’t be enabled for SoL to work correctly.

    SuperMicro BIOS settings accessed via ipmi-console

    SuperMicro BIOS settings accessed via ipmi-console

    If you want to see more verbose POST output, also disable Quiet Boot in the Advanced -> Boot Feature menu.

    Leave the BIOS by pressing Save Changes and Reset. After a short while you should be greeted by the Grub menu. However, there won’t be any more output, as Linux doesn’t know yet, where to connect to the SoL console. To exit the IPMI console, you have to press &..

    Setup Serial Console in Linux
    If you have a systemd-based distribution all you need to do is to modify the kernel command line and add a serial console. Edit /etc/default/grub and extend the GRUB_CMDLINE_LINUX variable as following:

    GRUB_CMDLINE_LINUX="rd.lvm.lv=vg00/swap rd.lvm.lv=vg00/slash rhgb quiet console=tty0 console=ttyS1,115200n8"
    

    Important: This has to be /dev/ttyS1 in case the BIOS knows only knows one additional serial port (which would correspond to /dev/ttyS0). If you have two additional serial ports in the BIOS (COM1/COM2), you must set /dev/ttyS2 here. Don’t configure Grub 2 to output itself to the serial console, as this is still handled by the default SoL configuration forwarding the initial system output.

    Regenerate the grub.cfg with:

    # grub2-mkconfig -o /boot/grub2/grub.cfg
    

    You might need to adjust the command and grub.cfg path according to your installation. The next time when the system is booted, the Linux system output will be available on the serial terminal and there will also be a login prompt. If you don’t have a systemd-based system, you additionally need to enable the serial console login terminal in the /etc/inittab. E.g.:

    T1:23:respawn:/sbin/getty -L ttyS1 115200 vt100
    
    Remotely Manage the System over IPMI

    Now everything we need is in place. We can independently from the Linux operating system or the local system console:

    1. Configure the BMC via ipmi-config (see above)
    2. Access the Linux console via ipmi-console
    3. Query system sensor values (temperatures, voltages, …) via ipmi-sensors
    4. Query system event log via ipmi-sel
    5. Initiate system startup and power reset via ipmi-power

    Even tough I have the latest available IPMI firmware installed, it sometimes happened to me, that the SoL wouldn’t properly connect to the Linux console any more after I disconnected when being in the BIOS. Or that I could navigate through the BIOS but it wouldn’t refresh the screen. Unfortunately the only way that I found to fix this, was to reboot the BMC. Of course also this is possible over IPMI:

    $ bmc-device -h 192.168.1.42 -u ADMIN -P --cold-reset
    

    There are many other things that you can do via IPMI. Also there are other tools which can access IPMI-enabled BMCs. Another famous one is ipmitool. You might want to checkout Adam Sweet’s wiki on IPMI on Linux for more details how to use ipmitool. Personally, I prefer FreeIPMI for being a GNU project, being very intuitive to use and so far it worked perfectly well for everything I needed.

    If you have some hints or additions, please leave a comment below. Thanks for reading.

    Dec 012015
     

    I recently had the challenge to migrate a physical host running Debian, installed on a ancient 32bit machine with a software RAID 1, to a virtual machine (VM). The migration should be as simple as possible so I decided to attach the original disks to the virtualization host and define a new VM which would boot from these disks. This doesn’t sound too complicated but it still included some steps which I was not so familiar with, so I write them down here for later reference. As an experiment I tried the migration on a KVM and a Xen host.

    Prepare original host for virtualized environment

    Enable virtualized disk drivers
    The host I wanted to migrate was running a Debian Jessie i686 installation on bare metal hardware. For booting it uses an initramfs which loads the required disk controller drivers. By default the initramfs only includes the drivers for the current hardware, so the paravirtualization drivers for KVM (virtio-blk) or Xen (xen-blkfront) are missing. This can be changed by adjusting the /etc/initramfs-tools/conf.d/driver-policy file to state that not the currently dependent (dep) modules should be loaded, but most:

    # Driver inclusion policy selected during installation
    # Note: this setting overrides the value set in the file
    # /etc/initramfs-tools/initramfs.conf
    MODULES=most
    

    Afterwards the initramfs must be rebuilt by running the following command (adjust the kernel version according to the kernel you are running):

    # dpkg-reconfigure linux-image-3.16.0-4-686-pae
    

    Now the initramfs contains all required modules to successfully boot the host from a KVM virtio or a Xen blockfront disk device. This can be checked with:

    # lsinitramfs /boot/initrd.img-3.16.0-4-686-pae | egrep -i "(xen|virtio).*blk.*\.ko"
    lib/modules/3.16.0-4-686-pae/kernel/drivers/block/virtio_blk.ko
    lib/modules/3.16.0-4-686-pae/kernel/drivers/block/xen-blkback/xen-blkback.ko
    lib/modules/3.16.0-4-686-pae/kernel/drivers/block/xen-blkfront.ko
    

    Enable serial console
    To easily access the server console from the virtualization host, it’s required to setup a serial console in the guest, which must be defined in /etc/default/grub:

    [...]
    GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8"
    GRUB_SERIAL_COMMAND="serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1"
    [...]
    # Uncomment to disable graphical terminal (grub-pc only)
    GRUB_TERMINAL="console serial"
    

    After regenerating the Grub configuration, the boot loader and local console will then also be accessible via virsh console:

    # grub-mkconfig -o /boot/grub/grub.cfg
    
    Run the host on a KVM hypervisor

    KVM is nowadays the de-facto default virtualization solution for running Linux on Linux and is very well integrated in all major distributions and a wide range of management tools. Therefore it’s really simple to boot the disks into a KVM VM:

    • Make sure the disks are not already assembled as md-device on the KVM host. I did this by uninstalling the mdadm utility.
    • Run the following virt-install command to define and start the VM:
      # virt-install --connect qemu:///system --name myhost --memory 1024 --vcpus 2 --cpu host --os-variant debian7 --arch i686 --import --disk /dev/disk/by-id/ata-WDC_WD10EFRX-68PJCN0_WD-WCC4J4527214 --disk /dev/disk/by-id/ata-WDC_WD10EFRX-68PJCN0_WD-WCC4J4366412 --network bridge=virbr0 --graphics=none

    To avoid confusion with the disk enumeration of the KVM host kernel, the disk devices are given as unique device paths instead of e.g. /dev/sdb and /dev/sdc. This clearly identifies the two disk devices which will then be assembled to a RAID 1 by the guest kernel.

    Run the host on a Xen hypervisor

    Xen has been a topic in this blog a few times before as I’m still using it since its early days. After a long time of major code refactorings it also made huge steps forward within the last few years, so that nowadays it’s nearly as easy to setup as KVM and again well supported in the large distributions. Because its architecture is a bit different and Linux is not the only supported platform some things work not quite the same as under KVM. Let’s find out…

    Prepare Xen Grub bootloader
    Xen virtual machines can be booted in various different ways, also depending whether the guest disk is running on an emulated disk controller or as paravirtualized disk. In simple cases, this is done with help of PyGrub loading a Grub Legacy grub.conf or a Grub 2 grub.cfg from a guests /boot partition. As my grub.cfg and kernel is “hidden” in a RAID 1 a fully featured Grub 2 is required. It is perfectly able to load its configuration from advanced disk setups such as RAID arrays, LVM volumes and more. With Xen such a setup is only possible when using a dedicated boot loader disk image which must be in the target architecture of the virtual machine. As I want to run a 32bit virtual machine, a i386-xen Grub 2 flavor needs to be compiled as basis for the boot loader image:

    • Clone the Grub 2 source code:
      $ git clone git://git.savannah.gnu.org/grub.git
    • Configure and build the source code. I was running the following commands on a Gentoo Xen server without issues. On other distributions you might need to install some development packages first:
      $ cd grub
      $ ./autogen.sh
      $ ./configure --target=i386 --with-platform=xen
      $ make -j3
    • Prepare a grub.cfg for the Grub image. The Grub 2 from the Debian guest host was installed on a mirrored boot disk called /dev/md0 which was mounted to /boot. Therefore the Grub image must be hinted to load the actual configuration from there:
      # grub.cfg for Xen Grub image grub-md0-i386-xen.img
      normal (md/0)/grub/grub.cfg
    • Create the Grub image using the previously compiled Grub modules:
      # grub-mkimage -O i386-xen -c grub.cfg -o /var/lib/libvirt/images/grub-md0-i386-xen.img -d ~user/grub/grub-core ~user/grub/grub-core/*.mod

    A more generic guide about generating Xen Grub images can be found at Using Grub 2 as a bootloader for Xen PV guests.

    Define the virtual machine
    virt-install can also be used to import disk (images) as Xen virtual machines. This works perfectly fine for e.g. importing Atomic Host raw images, but unfortunately the setup now is a bit more complicated. It requires me to specify the custom Grub 2 image as guest kernel for which I couldn’t find a corresponding virt-install argument (using v1.3.0). Therefore I first created a plain Xen xl configuration file and then converted it to a libvirt XML file. The initial configuration /etc/xen/myhost.cfg looks as following:

    name = "myhost"
    kernel = "/var/lib/libvirt/images/grub-md0-i386-xen.img"
    memory = 1024
    vcpus = 2
    vif = [ 'bridge=virbr0' ]
    disk = [ '/dev/disk/by-id/ata-WDC_WD10EFRX-68PJCN0_WD-WCC4J4527214,raw,xvda,rw', '/dev/disk/by-id/ata-WDC_WD10EFRX-68PJCN0_WD-WCC4J4366412,raw,xvdb,rw' ]
    

    Eventually this configuration can be converted to a libvirt domain XML with the following command:

    # virsh domxml-from-native --format xen-xl --config /etc/xen/myhost.cfg > /etc/libvirt/libxl/myhost.xml
    # virsh --connect xen:/// define /etc/libvirt/libxl/myhost.xml
    
    Summary

    At the end, I successfully converted the bare metal host to a KVM and/or Xen virtual machine. Once again Xen ended up being a bit more troublesome, but no nasty hacks were required and both configurations were quite straight forward. The boot loader setup for Xen could definitely be a bit simpler, but still it’s nice to see how well Grub 2 and Xen play together now. It’s also pleasing to see, that no virtualization specific configurations had to be made within the guest installation. The fact that the disk device names were changing from /dev/sd[ab] (bare metal) to /dev/vd[ab] (KVM) and /dev/xvd[ab] (Xen) was completely absorbed by the mdadm layer, but even without software RAID the /etc/fstab entries are nowadays generated by the installers in a way, that such migrations are easy possible.

    Thanks for reading. If you have some comments, don’t hesitate to write a note below.

    Aug 282014
     

    Today I just found out, how super easy it is to setup a safe HTTP authentication via Kerberos with help of FreeIPA. Having the experience of managing a manually engineered MIT Kerberos/OpenLDAP/EasyRSA infrastructure, I’m once again blown away by the simplicity and usability of FreeIPA. I’ll describe with only a few commands which can be run within less than 10 minutes how it’s possible to setup a fully featured Kerberos-authenticated Web server configuration. Prerequisite is a FreeIPA server (a simple guide for installation can be found for example here) and a RedHat-based Web server host (RHEL, CentOS, Fedora).

    Required Packages:
    First we are going to install the required RPM packages:

    # yum install httpd mod_auth_kerb mod_ssl ipa-client

    Register the Web server host at FreeIPA:
    Make sure the Web server host is managed by FreeIPA:

    ipa-client-install --domain=example.com --server=ipaserver.example.com --realm=EXAMPLE.COM --mkhomedir --hostname=webserver.example.com --configure-ssh --configure-sshd

    Create a HTTP Kerberos Principal and install the Keytab:
    The Web server is identified in a Kerberos setup through a keytab, which has to be generated and installed on the Web server host. First make sure that you have a valid Kerberos ticket of a FreeIPA account with enough permissions (e.g. ‘admin’):

    # kinit admin
    # ipa-getkeytab -s ipaserver.example.com -p HTTP/webserver.example.com -k /etc/httpd/conf/httpd.keytab

    This will create a HTTP service principal in the KDC and install the corresponding keytab in the Apache httpd configuration directory. Just make sure that it can be read by the httpd server account:

    # chown apache /etc/httpd/conf/httpd.keytab

    Create a SSL certificate
    No need to fiddle around with OpenSSL. Requesting, signing and installing a SSL certificate with FreeIPA is one simple command:

    # ipa-getcert request -k /etc/pki/tls/private/webserver.key -f /etc/pki/tls/certs/webserver.crt -K HTTP/webserver.example.com -g 3072

    This will create a 3072 bit server key, generate a certificate request, send it to the FreeIPA Dogtag CA, sign it and install the resulting PEM certificate on the Web server host.

    Configure Apache HTTPS
    The httpd setup is the only and last configuration which needs to be done manually. For HTTPS set the certificate paths in /etc/httpd/conf.d/ssl.conf:

    [...]
    SSLCertificateFile /etc/pki/tls/certs/webserver.crt
    SSLCertificateKeyFile /etc/pki/tls/private/webserver.key
    SSLCertificateChainFile /etc/ipa/ca.crt
    

    Additionally do some SSL stack hardening (you may also want to read this):

    [...]
    SSLCompression off
    SSLProtocol all -SSLv2 -SSLv3 -TLSv1.0
    SSLHonorCipherOrder on
    SSLCipherSuite "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH EDH+aRSA !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS !RC4"
    

    Kerberos HTTP Authentication:
    The final httpd authentication settings for ‘mod_auth_kerb‘ are done in /etc/httpd/conf.d/auth_kerb.conf or any vhost you want:

    <Location />
      SSLRequireSSL
      AuthType Kerberos
      AuthName "Kerberos Login"
      KrbMethodNegotiate On
      KrbMethodK5Passwd On
      KrbAuthRealms EXAMPLE.COM
      Krb5KeyTab /etc/httpd/conf/httpd.keytab
      require valid-user
    </Location>
    

    That’s it! After restarting the Web server you can login on https://webserver.example.com with your IPA accounts. If you don’t already have a valid Kerberos ticket in the Web client, KrbMethodNegotiate On enables interactive password authentication.

    Troubleshooting
    In case you get the following error message in the httpd error log, make sure the keytab exists and is readable by the httpd account (e.g. ‘apache’):

    [Wed Aug 27 07:23:04 2014] [debug] src/mod_auth_kerb.c(646): [client 192.168.122.1] Trying to verify authenticity of KDC using principal HTTP/webserver.example.com@EXAMPLE.COM
    [Wed Aug 27 07:23:04 2014] [debug] src/mod_auth_kerb.c(689): [client 192.168.122.1] krb5_rd_req() failed when verifying KDC