Feb 242020
 

This is the second part of my field report about installing the oVirt 4.4-alpha release on CentOS 8 in a hyperconverged setup. In the first part I was focusing on setting up the GlusterFS storage cluster and now I’m going to describe my experience of the self-hosted engine installation.

If you think about repeating this installation on your hardware please let me remind you: This software is currently in alpha status. This means there are likely still many bugs and rough edges and if you succeed to install it successfully there is no guarantee for updates not breaking everything again. Please don’t try this anywhere close to production systems or data. I won’t be able to assist you in any way if things turn out badly.

Cockpit Hosted-Engine Wizard

Before we can start installing the self-hosted engine, we need to install a few more packages:

# dnf install ovirt-engine-appliance vdsm-gluster

Similar to the GlusterFS setup, also the hosted engine setup can be done from the Cockpit Web interface:

The wizard is also here pretty self-explanatory. There are a few options missing in Web-UI compared to the commandline installer (hosted-engine --deploy) e.g. you cannot customize the name of the libvirt domain which is called ‘HostedEngine’ by default. You must give the common details such as hostname, VM resources, some network settings, credentials for the VM and oVirt and that’s pretty much it:

Before you start deploying the VM there is also a quick summary of the settings and then an answer file will be generated. While the GlusterFS setup created a regular Ansible inventory the hosted engine setup has its own INI-format. It’s a useful feature that even when the deployment aborts, it can always be restarted from the Web interface without the need to fill in the form again and again. Indeed, I used this to my advantage a lot because it took me at least 20 attempts before the hosted engine VM was setup successfully.

Troubleshooting hosted-engine issues

Once VM deployment was running I found that the status output in the Cockpit Web interface heavily resembled Ansible output. It seems that a big part of the deployment code in the hosted-engine tool was re-implemented now using the ovirt-ansible-hosted-engine-setup Ansible roles in the background. If you’re familiar with Ansible this definitely simplifies troubleshooting and also allows a better understanding what is going on. Unfortunately there is still a layer of hosted-engine code above Ansible so that I couldn’t figure out, if it’s possible to run a playbook from the shell that would do the same setup.

Obviously it didn’t take long for an issue to pop-up:

  • The first error was that Ansible couldn’t connect to the hosted-engine VM that was freshly created from the oVirt appliance disk image. The error output in the Web interface is rather limited but also here a log file exists that can be found at a path like /var/log/ovirt-engine/setup/ovirt-engine-setup-20200218154608-rtt3b7.log. In the log file I found:
    2020-02-18 15:32:45,938+0100 DEBUG ansible on_any args localhostTASK: ovirt.hosted_engine_setup : Wait for the local VM kwargs 
    2020-02-18 15:35:52,816+0100 ERROR ansible failed {
        "ansible_host": "localhost",
        "ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
        "ansible_result": {
            "_ansible_delegated_vars": {
                "ansible_host": "ovirt.oasis.home"
            },
            "_ansible_no_log": false,
            "changed": false,
            "elapsed": 185,
            "msg": "timed out waiting for ping module test success: Using a SSH password instead of a key is not possible because Host Key checking is enabled and sshpass does not support this.  Please add this host's fingerprint to your know
    n_hosts file to manage this host."
        },
        "ansible_task": "Wait for the local VM",
        "ansible_type": "task",
        "status": "FAILED",
        "task_duration": 187
    }

    A manual SSH login with the root account on the VM was possible after accepting the fingerprint. Maybe this is still a bug or I missed a setting somewhere, but the easiest way to solve this was to create a ~root/.ssh/config file on the hypervisor host with the following content. The hostname is the hosted-engine FQDN:

    Host ovirt.oasis.home
        StrictHostKeyChecking accept-new
    

    Each installation attempt will make sure that the previous host key is deleted from the known_hosts file so no need to worry about changing keys on multiple installation tries. The deployment could simply be restarted by pressing the “Prepare VM” button once again.

  • During the next run the connection to the hosted-engine VM succeeded and it nearly completed all of the setup task within the VM but then failed when trying to restart the ovirt-engine-dwhd service:
    2020-02-18 15:48:45,963+0100 INFO otopi.plugins.ovirt_engine_setup.ovirt_engine_dwh.core.service service._closeup:52 Starting dwh service
    2020-02-18 15:48:45,964+0100 DEBUG otopi.plugins.otopi.services.systemd systemd.state:170 starting service ovirt-engine-dwhd
    2020-02-18 15:48:45,965+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:813 execute: ('/usr/bin/systemctl', 'start', 'ovirt-engine-dwhd.service'), executable='None', cwd='None', env=None
    2020-02-18 15:48:46,005+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:863 execute-result: ('/usr/bin/systemctl', 'start', 'ovirt-engine-dwhd.service'), rc=1
    2020-02-18 15:48:46,006+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:921 execute-output: ('/usr/bin/systemctl', 'start', 'ovirt-engine-dwhd.service') stdout:
    
    
    2020-02-18 15:48:46,006+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:926 execute-output: ('/usr/bin/systemctl', 'start', 'ovirt-engine-dwhd.service') stderr:
    Job for ovirt-engine-dwhd.service failed because the control process exited with error code. See "systemctl status ovirt-engine-dwhd.service" and "journalctl -xe" for details.
    
    2020-02-18 15:48:46,007+0100 DEBUG otopi.context context._executeMethod:145 method exception
    Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
        method['method']()
      File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine-dwh/core/service.py", line 55, in _closeup
        state=True,
      File "/usr/share/otopi/plugins/otopi/services/systemd.py", line 181, in state
        service=name,
    RuntimeError: Failed to start service 'ovirt-engine-dwhd'
    2020-02-18 15:48:46,008+0100 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Closing up': Failed to start service 'ovirt-engine-dwhd'
    2020-02-18 15:48:46,009+0100 DEBUG otopi.context context.dumpEnvironment:765 ENVIRONMENT DUMP - BEGIN
    2020-02-18 15:48:46,010+0100 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/error=bool:'True'
    2020-02-18 15:48:46,010+0100 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/exceptionInfo=list:'[(, RuntimeError("Failed to start service 'ovirt-engine-dwhd'",), )]'
    2020-02-18 15:48:46,012+0100 DEBUG otopi.context context.dumpEnvironment:779 ENVIRONMENT DUMP - END
    

    Fortunately I was able to login to the hosted-engine VM and found the following blunt error:

    -- Unit ovirt-engine-dwhd.service has begun starting up.
    Feb 18 15:48:46 ovirt.oasis.home systemd[30553]: Failed at step EXEC spawning /usr/share/ovirt-engine-dwh/services/ovirt-engine-dwhd/ovirt-engine-dwhd.py: Permission denied
    -- Subject: Process /usr/share/ovirt-engine-dwh/services/ovirt-engine-dwhd/ovirt-engine-dwhd.py could not be executed
    

    Indeed, the referenced script was not marked executable. Fixing it manually and restarting the service showed that this would succeed. But there is one problem. This change is not persisted. On the next deployment run, the hosted-engine VM will be deleted and re-created again. When searching for a nicer solution I found that this bug is actually already fixed in the latest release of ovirt-engine-dwh-4.4.0-1.el8.noarch.rpm but the appliance image (ovirt-engine-appliance-4.4-20200212182535.1.el8.x86_64) was only including ovirt-engine-dwh-4.4.0-0.0.master.20200206083940.el7.noarch and there is no newer appliance image. That’s part of the experience when trying alpha releases but it’s not a blocker. Eventually I found that there is a directory where you can place an Ansible tasks file which will be executed in the hosted-engine VM before the setup is run. So I created the file hooks/enginevm_before_engine_setup/yum_update.yml in the /usr/share/ansible/roles/ovirt.hosted_engine_setup/ directory with the following content:

    ---
    - name: Update all packages
      package:
      name: '*'
      state: latest
    

    From then on each deployment attempt was first updating the packages including ‘ovirt-engine-dwh’ in the VM before the hosted-engine would continue to configure and restart the service.

  • The next issue was suddenly appearing when I tried to re-run the deployment. The Ansible code would fail early with an error that it cannot update the routing rules on the hypervisor:
    2020-02-18 16:17:38,330+0100 DEBUG ansible on_any args  kwargs 
    2020-02-18 16:17:38,664+0100 INFO ansible task start {'status': 'OK', 'ansible_type': 'task', 'ansible_playbook': '/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml', 'ansible_task': 'ovirt.hosted_engine_setup : Add IPv4 outbo
    und route rules'}
    2020-02-18 16:17:38,664+0100 DEBUG ansible on_any args TASK: ovirt.hosted_engine_setup : Add IPv4 outbound route rules kwargs is_conditional:False 
    2020-02-18 16:17:38,665+0100 DEBUG ansible on_any args localhostTASK: ovirt.hosted_engine_setup : Add IPv4 outbound route rules kwargs 
    2020-02-18 16:17:39,214+0100 DEBUG var changed: host "localhost" var "result" type "" value: "{
        "changed": true,
        "cmd": [
            "ip",
            "rule",
            "add",
            "from",
            "192.168.222.1/24",
            "priority",
            "101",
            "table",
            "main"
        ],
        "delta": "0:00:00.002805",
        "end": "2020-02-18 16:17:38.875350",
        "failed": true,
        "msg": "non-zero return code",
        "rc": 2,
        "start": "2020-02-18 16:17:38.872545",
        "stderr": "RTNETLINK answers: File exists",
        "stderr_lines": [
            "RTNETLINK answers: File exists"
        ],
        "stdout": "",
        "stdout_lines": []
    }"
    

    So I was checking the rules manually and yes, they were already there. I thought that’s an easy case, that must be a simple idempotency issue in the Ansible code. But when looking up the code there was already a condition in place that should prevent this case from happening. Even after multiple attempts to debug this code, I couldn’t find the reason why this check is failing. Eventually I found the GitHub pull request #96 where someone was already refactoring this code with a commit message “Hardening existing ruleset lookup”. So I forward-ported the patch to the release 1.0.35 which fixed the problem. The PR is already open for more than a year with no indication that it would be merged soon, so I still reported the issue in ovirt-ansible-hosted-engine-setup #289.
    I only found out about ovirt-hosted-engine-cleanup a few hours later, so with its help you can easily work-around this issue by cleaning up the installation before another retry.

  • Another though to debug but easy to fix issue popped up after the hosted-engine VM setup completed and the Ansible role was checking the oVirt events for errors:
    2020-02-19 01:46:53,723+0100 ERROR ansible failed {
        "ansible_host": "localhost",
        "ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
        "ansible_result": {
            "_ansible_no_log": false,
            "changed": false,
            "msg": "The host has been set in non_operational status, deployment errors:   code 4035: Gluster command [] failed on server .,    code 10802: VDSM loki.oasis.home command GlusterServersListVDS failed: The method does not exist or is not available: {'method': 'GlusterHost.list'},   fix accordingly and re-deploy."
        },
        "ansible_task": "Fail with error description",
        "ansible_type": "task",
        "status": "FAILED",
        "task_duration": 0
    }
    

    This error is not in the Ansible code anymore but the engine itself fails to query the GlusterFS status on the hypervisor. This is done via VDSM, a daemon that runs on each oVirt hypervisor and manages the hypervisor configuration and status. Maybe the VDSM log (/var/log/vdsm/vdsm.log) reveals more insights:

    2020-02-19 01:46:45,786+0100 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getCapabilities succeeded in 3.33 seconds (__init__:312)
    2020-02-19 01:46:45,981+0100 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call GlusterHost.list failed (error -32601) in 0.00 seconds (__init__:312)
    

    Seems that regular RPC calls to VDSM are successful but only the GlusterFS query is failing. I tracked down the source code of this implementation and found that there is a CLI command that can be used to run this query:

    # vdsm-client --gluster-enabled -h
    Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/vdsmclient/client.py", line 276, in find_schema
        with_gluster=gluster_enabled)
      File "/usr/lib/python3.6/site-packages/vdsm/api/vdsmapi.py", line 156, in vdsm_api
        return Schema(schema_types, strict_mode, *args, **kwargs)
      File "/usr/lib/python3.6/site-packages/vdsm/api/vdsmapi.py", line 142, in __init__
        with io.open(schema_type.path(), 'rb') as f:
      File "/usr/lib/python3.6/site-packages/vdsm/api/vdsmapi.py", line 95, in path
        ", ".join(potential_paths))
    vdsm.api.vdsmapi.SchemaNotFound: Unable to find API schema file, tried: /usr/lib/python3.6/site-packages/vdsm/api/vdsm-api-gluster.pickle, /usr/lib/python3.6/site-packages/vdsm/api/../rpc/vdsm-api-gluster.pickle

    Ah, that’s better. I love such error messages. Thanks to that it was not so hard to find, that I actually overlooked to install the vdsm-gluster package on the hypervisor.

That’s it after that the deployment completed successfully:

And finally a screenshot of the oVirt 4.4-alpha administration console. Yes, it works:

Conclusion

At the end the of the day most of the issues happened because I was not very familiar with the setup procedure and at the same time refused to follow any setup instructions for an older release. There were a minor bug with the ovirt-engine-dwh restart issue, that was already fixed upstream but didn’t made it yet into the hosted-engine appliance image. Something that is expected in an alpha release.

I also quickly setup some VMs to test the basic functionality of oVirt and couldn’t find any major issues so far. I guess most people using oVirt are much more experienced with it than me anyway, so there shouldn’t be any concerns in trying oVirt 4.4-alpha yourself. To me that was an interesting experience and I’m very happy about the Ansible integration that this project is pushing. It was also a nice experience to use Cockpit and I believe that’s definitely something that makes this product more appealing to setup and use for a wide range of IT professionals. As long as it can be done via command line too, I’ll be happy.

Feb 232020
 

For a while I had a oVirt server in a hyperconverged setup which means that the hypervisor was also running a GlusterFS storage server and that the oVirt management virtual machine was inside the oVirt cluster (self-hosted engine). On top of oVirt I was running a OKD cluster with a containerized GlusterFS cluster that could be used for persistent volumes by the container workload. All of this was running on a single hypervisor with a single SSD which unfortunately gave up on me after a few years in operation. Recently I stumbled upon the oVirt 4.4-alpha. Next to initial support for running it on CentOS 8 also a proper support for ignition that is used by (Fedora and Red Hat) CoreOS and therefore OpenShift/OKD 4 attracted my attention. Why not give it a try and see how far I come…? After a few hours of tinkering I succeeded the installation:

And now I’m going to describe what was necessary to do so. Not everything that I’ll mention is brand new. My past experience of setting up such a system is more or less based on the guide Up and Running with oVirt 4.0 and Gluster Storage that I was following a few years ago, so I’m also highlighting a few things that have changed since then.

If you think about repeating this installation on your hardware please let me remind you: This software is currently in alpha status. This means there are likely still many bugs and rough edges and if you succeed to install it successfully there is no guarantee for updates not breaking everything again. Please don’t try this anywhere close to production systems or data. I won’t be able to assist you in any way if things turn out badly.

I split this field report into two parts, the first one discussing the GlusterFS storage setup and the second one explaining my challenges when setting up the oVirt self-hosted-engine.

Hypervisor disk layout

I was using a minimal install of a CentOS 8 on a bare-metal server. Make sure you either have two disks or create a separate partition when installing CentOS so that the GlusterFS storage can life on its own block device. My disk layout looks something like this:

[root@loki ~]# lsblk
NAME                    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                       8:0    0 931.5G  0 disk 
├─sda1                    8:1    0   100M  0 part 
├─sda2                    8:2    0   256M  0 part /boot
├─sda3                    8:3    0    45G  0 part 
│ ├─vg_loki-slash       253:0    0    10G  0 lvm  /
│ ├─vg_loki-swap        253:1    0     4G  0 lvm  [SWAP]
│ ├─vg_loki-var         253:2    0     5G  0 lvm  /var
│ ├─vg_loki-home        253:3    0     2G  0 lvm  /home
│ └─vg_loki-log         253:4    0     2G  0 lvm  /var/log
└─sda4                    8:4    0   500G  0 part

The CentOS installation is placed on a LVM volume group (vg_loki on /dev/sda3) with some individual volumes for dedicated mount points and then there is /dev/sda4 which is an empty disk partition that will be used by GlusterFS later. If you wonder how to do such a setup with the CentOS 8 installer… I don’t know. I tried for a moment to somehow configure this setup in the installer, but eventually I gave up, manually partitioned the disk and created the volume group and then used the pre-generated setup in the installer which perfectly detected what I was doing on the shell.

Install software requirements

First you need to enable the oVirt 4.4-alpha package repository by installing the corresponding release package:

# dnf install https://resources.ovirt.org/pub/yum-repo/ovirt-release44-pre.rpm

Recently a lot of effort was invested into incorporating the oVirt setup into the Cockpit Web interface and it’s now even the recommended installation method for the downstream Red Hat Virtualization (RHV). When I setup my previous hyperconverged oVirt 4.0 this wasn’t available back then so of course I’m going to try this. To setup Cockpit and the oVirt integration the following packages need to be installed:

# dnf install cockpit cockpit-ovirt-dashboard glusterfs-server

After logging into Cockpit that runs on the hypervisor host on port 9090 there is a dedicated oVirt tab with two entries:

If you continue with the hyperconverged setup, there now even is a dedicated option to install a single node only GlusterFS “cluster”!

This was a big positive surprise to me because the previously used gdeploy tool was insisting on a three node GlusterFS cluster years ago.

Running the GlusterFS wizard

After this revelation the GlusterFS setup is supposedly straight forward. Still I ran into some issues that I could probably have avoided by carefully reading the installation instructions for oVirt 4.3. Nonetheless I’m quickly going to mention a few points here in case other people are struggling with the same and search the Web for these error messages:

  • On the first screen I had an error that the setup cannot proceed because "gluster-ansible-roles is not installed on Host":

    However, the related package including the Ansible roles from gluster-ansible was clearly there:

    # rpm -q gluster-ansible-roles
    gluster-ansible-roles-1.0.5-7.el8.noarch
    

    Eventually I found that my sudo rules for my unprivileged user account are not properly picked up by Cockpit so I restarted the setup by using the root account which then successfully detected the Ansible roles.

  • When selecting the brick setup the “Raid Type” must be changed to “JBOD” and the device name that was reserved for the GlusterFS storage must be entered:Eventually the wizard will create a dedicated LVM volume group for GlusterFS bricks and a LVM thin-pool volume if you wish so:
    # lvs gluster_vg_sda4
      LV                               VG              Attr       LSize    Pool                             Origin Data%  Meta%  Move Log Cpy%Sync Convert
      gluster_lv_data                  gluster_vg_sda4 Vwi-aot---  125.00g gluster_thinpool_gluster_vg_sda4        3.35                                   
      gluster_lv_engine                gluster_vg_sda4 -wi-ao----   75.00g                                                                                
      gluster_lv_vmstore               gluster_vg_sda4 Vwi-aot---  300.00g gluster_thinpool_gluster_vg_sda4        0.05                                   
      gluster_thinpool_gluster_vg_sda4 gluster_vg_sda4 twi-aot--- <421.00g                                         1.03   0.84
  • Before the Ansible playbook that will setup the storage is executed the generated inventory file will be displayed. Because I'm very familiar with Ansible anyway, I love this part! It also makes it easier to understand which playbook command to run when troubleshooting something on the command line where the Cockpit Web interface doesn't make sense to be involved anymore:
    The "Enable Debug Logging" is definitely worth to be enabled especially if you run this the first time on your server. It gives you much more insight in what Ansible is actually doing on your hypervisor.
  • When finally running the "Deploy" step it didn't take long to fail the playbook with the following error:
    TASK [Check if provided hostnames are valid] ***********************************
    task path: /usr/share/cockpit/ovirt-dashboard/ansible/hc_wizard.yml:29
    fatal: [loki.oasis.home]: FAILED! => {"msg": "The conditional check 'result.results[0]['stdout_lines'] > 0' failed. The error was: Unexpected templating type error occurred on ({% if result.results[0]['stdout_lines'] > 0 %} True {% else %} False {% endif %}): '>' not supported between instances of 'list' and 'int'"}
    

    The involved code is untouched since a long time. So I reported the issue at RHBZ #1806298. I'm not sure how this could ever work but the fix is also trivial. After changing the following lines this task was successfully passing:

    --- /usr/share/cockpit/ovirt-dashboard/ansible/hc_wizard.yml.orig       2020-02-18 14:48:33.678471259 +0100
    +++ /usr/share/cockpit/ovirt-dashboard/ansible/hc_wizard.yml    2020-02-18 14:48:55.810456470 +0100
    @@ -30,7 +30,7 @@
           assert:
             that:
               - "result.results[0]['rc'] == 0"
    -          - "result.results[0]['stdout_lines'] > 0"
    +          - "result.results[0]['stdout_lines'] | length > 0"
             fail_msg: "The given hostname is not valid FQDN"
           when: gluster_features_fqdn_check | default(true)

    Btw. the error message can not only be seen in the Web UI but is also written to a log file: /var/log/cockpit/ovirt-dashboard/gluster-deployment.log

  • There was another issue that cost me more effort to track down. The deployment playbook was failing to add the firewalld rules for GlusterFS:
    TASK [gluster.infra/roles/firewall_config : Add/Delete services to firewalld rules] ***
    task path: /etc/ansible/roles/gluster.infra/roles/firewall_config/tasks/main.yml:24
    failed: [loki.oasis.home] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"}

    Indeed firewalld doesn't know about a 'glusterfs' service:

    # rpm -q firewalld
    firewalld-0.7.0-5.el8.noarch
    # firewall-cmd --get-services | grep glusterfs
    #
    

    Is this an old version? From where do I get the 'glusterfs' firewalld service definition? The solution to this is as simple as embarrassing. I found that the service definition is packaged as part of the glusterfs-server RPM which was still missing on my server. After installing it also this issue was solved.

Eventually the deployment succeeded and the CentOS 8 host was converted into a single node GlusterFS storage cluster:

# gluster volume status
Status of volume: data
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick loki.oasis.home:/gluster_bricks/data/
data                                        49153     0          Y       23806
 
Task Status of Volume data
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: engine
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick loki.oasis.home:/gluster_bricks/engin
e/engine                                    49152     0          Y       23527
 
Task Status of Volume engine
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: vmstore
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick loki.oasis.home:/gluster_bricks/vmsto
re/vmstore                                  49154     0          Y       24052
 
Task Status of Volume vmstore
------------------------------------------------------------------------------
There are no active volume tasks

Conclusion

I'm super pleased with the installation experience so far. The new GlusterFS Ansible roles had no issues setting up the bricks and volumes. The Cockpit Web-GUI was easy to use and always clearly communicated what was going on. There is now a supported configuration of a one node hyperconverged oVirt setup which makes me happy too. Kudos to everyone involved with this, great work!

Now let's continue with the oVirt self-hosted engine setup.

Nov 062019
 

My activities in October were mostly related to updating my COPR repositories for CentOS 8 and cleaning up the old repositories:

  • I updated the ganto/jo COPR repository to support CentOS 8.
  • I updated the ganto/vcsh COPR repository to support CentOS 8 and added package builds for the alternative architectures (aarch64 and ppc64le).
  • Thanks to the help of jmontleon I was finally able to build LXD which is available in my ganto/lxc3 repository for CentOS 8. I also updated the RPM for the latest stable release LXD 3.18.
  • After years of development the distrobuilder tool which is meant to replace the shellscript-based LXC templates was tagged in a first 1.0 release that should now also be able to build CentOS 8 container images. Of course I updated the corresponding RPM in the ganto/lxc3 COPR repository accordingly. I’m not sure how they decide to do new releases therefore I might decide to go back building regular git snapshot releases of this tool in the future.
  • I updated the ganto/goaccess COPR repository to support CentOS 8 and also increased the built goaccess version to a git snapshot from May 2019 based on version 1.3. Unfortunately the official Fedora package is still only at version 1.2. I was first testing the latest git snapshot but then found that it is affected by a bug (GitHub issue 1575) which would fail to render the access graphs properly.
  • The last COPR repository pending an update for CentOS 8 is ganto/umoci which still fails because of go-md2man missing from EPEL 8.
  • I deleted some outdated COPR repositories (ganto/lxc, ganto/lxd, ganto/lxdock) and archived the related GitHub repositories holding the RPM spec files.

Then I was also experimenting with adding Debian machines to a CentOS FreeIPA identity management server via Ansible. Years ago I wrote an Ansible role freeipa-client which was able to do that but still required manual setup of the Kerberos keytab on the client machine. I plan to replace that with a collection of new roles trying to blend-in with DebOps as much as possible. But unfortunately nothing ready to show yet.

Finally, as always, I updated a lot of ebuilds in my linuxmonk-overlay Gentoo overlay.

Oct 012019
 

I’m starting a new series of blog posts summarizing my various activities regarding free software projects. There might not be every month something worth mentioning, but this month I was quite busy what might be interesting for some of you.

Following I’ll list some of the activities I was involved regarding free software projects in September:

  • After the official release of CentOS 8, I started rebuilding the packages in my lxc3 COPR repository for CentOS 8. The lxd package is still missing and I’m planning to provide it for CentOS 8 together with the pending update to lxd-3.17. A rebuild of the packages in my various other COPR repositories can be expected in the coming weeks.
  • Being the package maintainer of the spectre-meltdown-checker package in Fedora and EPEL, I followed the instructions to request a package branch for epel-8. This was approved a few hours ago, so the packages is now available via Koji and awaits approval in Bodhi for inclusion into the EPEL testing and eventually stable repository. Please give some karma if you’d like to accelerate this.
  • I merged some pull-requests in the Gentoo go-overlay git repository where the original maintainer entrusted me with commit permissions. Because he didn’t participate since last December, I used the chance to cleanup the repository to pass the repoman checks again and eventually merged a PR for the latest traeffik 1.x (1.7.18) release.
  • I put some effort into packaging the Gnome 3.34 release in my personal Gentoo linuxmonk-overlay. Of course I’m running it on my main workstation on top of Wayland without any major issues so far. Give it a try if you can’t wait for the official ebuilds to be ready.
  • I released version 0.1.2 of my acme-tiny Ansible role which fixes an annoying bug. It could happen that if the certificate renewal was unsuccessful, a still valid certificate would have been overwritten with an empty file. Now the role will make a backup copy of the old certificate by default and validate the new certificate before replacing the old one.
Dec 172018
 

The OpenShift community produces a lot of interesting tutorials about how to try new solutions and configurations but unfortunately they are mostly based on a minimal setup such as MiniShift, which is definitely a cool gimmick, but badly resembles a real cluster setup. Often those posts only concentrate on the known good path about how something is supposed to function in the best case. They rarely mention how it could be debugged or fixed if it doesn’t work as expected. As all of us know, the more complex a system is, the more can go wrong and this technology is no exception especially when run in a real distributed setup. To give you some insight in how such procedures can go wrong, I’d like to share the experience I made when I tried to update my multi-master/multi-node OKD cluster. As an experienced Linux engineer or developer you might think that version updates are nothing special or exciting, but this experience will disabuse you. I hit many issues and here is how I did it.

IMPORTANT: This is not a guide how to upgrade OpenShift. It’s only a field report which is missing a lot of technical details for a successful upgrade. Please always investigate the official documentation.

Starting Position

At home I run a small OKD cluster consisting of three masters, each also hosting a etcd member, and four nodes, of which there are two infrastructure nodes, hosting the routers and registry, and two compute nodes, hosting the applications. I feel that’s the minimal setup required to resemble a production-like cluster. To make the setup a bit more interesting the persistent storage is served by a container-native storage (CNS) configuration were three GlusterFS pods are distributed on the masters. This definitely deviates from how a production cluster should be setup, but as I’m running this locally at home, I don’t have enough resources available for separate storage nodes.

My masters and nodes are running atop of CentOS Atomic Host which I updated to the latest 7.1811 release just a few days ago. As identity provider for OpenShift I’m using a FreeIPA server on a separate CentOS box. Since I installed this cluster with OpenShift Origin 3.9 five months ago, it was running stable and I had a lot of fun with it. After the recently published security advisories time has come to finally take a chance to upgrade OpenShift.

What will change?

The first thing, you should always do before starting an OpenShift update, is to carefully read the release notes. I explicitly linked the Red Hat OpenShift Container Platform (OCP) release notes here, because OKD unfortunately doesn’t nicely touch up theirs. For the initial release update they are mostly congruent. Make sure to study it carefully, as this might be the primary source of information once something starts going down. For the 3.10 update, an important information is the new handling of the containerized master controller and API services. Eventually we now have a basic idea about what to expect.

Updating the Ansible Inventory

It would be nice, if there was a fool-proof command to run the update and it seems that OpenShift 4 with its Cluster Version Operator is heading there. But until then we need to carefully study and follow the official OKD 3.10 Upgrade Guide. It’s important to get the documentation for the correct release because the involved adjustments to the Ansible inventory are different from release to release. For those not knowing how OpenShift 3.x release upgrades work, it’s done via Ansible playbooks which are using the same inventory (definition on how everything will be configured) as the initial OpenShift cluster installation.

In my inventory file, I first added the Node Group assignments. E.g. the infrastructure nodes are no longer defined via openshift_node_labels variable, but via a dedicated openshift_node_group_name variable which references a node group definition from the openshift_node_groups configuration. The same changes have to be done also for the master and the compute nodes:

  • OpenShift 3.9:
    [infra-nodes:vars]
    # Set region to be dedicated for infrastructure pods
    openshift_node_labels={'region': 'infra', 'zone': 'default'}
    
  • OpenShift 3.10:
    [infra-nodes:vars]
    # Set infra node group
    openshift_node_group_name='node-config-infra'
    

Note that although the openshift_node_labels variable is no longer effective, no labels will be removed during the upgrade. So if you don’t get the label definition right at the beginning you don’t have to worry that after an in-place upgrade some workload is suddenly not scheduled anymore.

I had some custom openshift_node_kubelet_args defined in my OpenShift 3.9 inventory but this variable is also not respected any longer. With 3.10 the correct way to customize the node configuration is to define a edits argument in the corresponding node group definition, which is then applied to a ConfigMap resource by the custom yedit Ansible module. While writing such a definition itself is already not super intuitive, it can only be done by re-defining the entire openshift_node_groups variable, possibly also blowing up every other node group definition if done wrong. For the moment, I chose to drop my custom node configuration entirely to make the inventory less error prone for now.

Before running the upgrade playbooks it’s also important that every manual configuration update done in the past (e.g. in the master-config.yml) has to be reflected somewhere in the Ansible inventory. Otherwise the change might be lost after the upgrade. In my setup I still had to add the LDAP authenticator to the openshift_master_identity_providers variable because I added it manually after the initial cluster installation.

The section about Special Considerations When Using Containerized GlusterFS gave me a bit of a bad feeling as my GlusterFS pods are running on the control-plane hosts. But it’s not an easy task to change that now, so I chose to still go on with the upgrade and hope for a work-around in case something should break.

Fixing a failed CNS Brick Process

Once I felt confident that my inventory was in good shape, I started the control-plane upgrade playbook placed at playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml. Only after a few minutes it already failed for the first time. The error message said, that my hosted registry presistent volume is not healthy. That are good news, because the playbook did properly detect my unconventional CNS setup and was even able to check the healthiness. Fortunately, this issue was already familiar to me and it was easily fixed. Here is how you do this:

  1. Change to the glusterfs project (or any custom project where you are running the GlusterFS pods) as a project or cluster administrator and query the names of the GlusterFS pods:
    $ oc project glusterfs
    Now using project "glusterfs" on server "https://openshift.example.com:8443".
    $ oc get pods -n glusterfs -o wide
    NAME                      READY     STATUS    RESTARTS   AGE       IP            NODE
    glusterfs-storage-dz8qj   1/1       Running   3          2d        10.0.0.10     master01.example.com
    glusterfs-storage-jncsl   1/1       Running   3          1d        10.0.0.11     master02.example.com
    glusterfs-storage-r24wg   1/1       Running   2          11h       10.0.0.12     master03.example.com
    heketi-storage-1-w8c42    1/1       Running   1          6d        10.129.0.21   node02.example.com
    
  2. Connect to one of the GlusterFS pods and list the volume status. E.g.:
    $ oc rsh glusterfs-storage-dz8qj
    sh-4.2# gluster volume status
    Status of volume: glusterfs-registry-volume                                 
    Gluster process                             TCP Port  RDMA Port  Online  Pid  
    ------------------------------------------------------------------------------
    Brick 10.0.0.11:/var/lib/heketi/mounts/vg_61                                 
    bc5a26248cc6ea9fb7ffaae4edbe93/brick_128dfe                                 
    b5436dad25702e689d3d6f4b8a/brick            49152     0          Y       204
    Brick 10.0.0.12:/var/lib/heketi/mounts/vg_5e                                   
    cf67ed85f71cf28090d7db1acc6433/brick_d36469                                 
    4c7034276c22e264eb2576413b/brick            49152     0          N
    Brick 10.0.0.10:/var/lib/heketi/mounts/vg_49                                   
    e81b0f4c91942c7657e9b7ffff7834/brick_8717e0                                 
    992390a7c04431890ba56b7656/brick            49152     0          Y       224
    Self-heal Daemon on localhost               N/A       N/A        Y       215
    Self-heal Daemon on 10.0.0.11               N/A       N/A        Y       163  
    Self-heal Daemon on 10.0.0.12               N/A       N/A        Y       172  
                                                                                
    Task Status of Volume glusterfs-registry-volume                             
    ------------------------------------------------------------------------------
    There are no active volume tasks
    [...]
    

    According to my experience it can happen that sometimes a brick displays a N in the Online column which means that the corresponding brick process wasn’t started successfully. If multiple bricks of the same volume are down, your entire volume is down and must be properly recovered. In such a case don’t continue with the steps below!

  3. Via IP address of the brick, you can figure out which host is affected and then you can simply delete the corresponding pod:
    $ oc delete pod glusterfs-storage-r24wg
    

    The pod will be automatically restarted and the brick processes should come up this time.

Fixing the Hosted Registry Storage Definition

The second run of the control-plane playbook eventually attested that all GlusterFS volumes are healthy but again it failed only two tasks later with a rather cryptic error message, something like:

TASK [openshift_storage_glusterfs : Check for GlusterFS cluster health] **********************************************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/cluster_health.yml:4
Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py
FAILED - RETRYING: Check for GlusterFS cluster health (120 retries left).Result was: {
    "attempts": 1, 
    "changed": false, 
    "invocation": {
        "module_args": {
            "cluster_name": "registry", 
            "exclude_node": "master01.example.com", 
            "oc_bin": "/usr/local/bin/oc", 
            "oc_conf": "/etc/origin/master/admin.kubeconfig", 
            "oc_namespace": "default"
        }

Didn’t it just said, that all the GlusterFS volumes are healthy? What the heck is "cluster_name": "registry" and what is it doing with GlusterFS in the ‘default’ namespace anyway?

The solution for this, I found after digging deep in the openshift_storage_glusterfs Ansible role and reading the CNS installation instructions again and again. I became a victim of my “simplified” CNS setup. The reference installation is meant to have two separate CNS GlusterFS clusters. One exclusively for the hosted registry volume (hinted by the [glusterfs_registry] Ansible host group) and a second cluster for any other persistent volumes (hinted by the [glusterfs] Ansible host group). As mentioned before, I’m limited in available hosts so I added the master hosts to both host groups and set the glusterfs_devices variable to the same device when installing the CNS. That’s already everything that was needed to create the hosted registry volume with OpenShift 3.9 in the “regular” CNS cluster. However the 3.10 playbook expects the registry volume to be in a different project with a different naming. Fortunately all that was needed to fix this were some additional inventory variables in the [OSEv3:vars]:

# Adjust variables for registry storage to match default converged glusterfs storage setup
openshift_storage_glusterfs_registry_name=storage
openshift_storage_glusterfs_registry_namespace=glusterfs

With the updated inventory I started the control-plane upgrade playbook once more. This time it ran for quite a while and even started to do some real stuff. It replaced the docker run command in the ‘origin-node’ systemd service with a runc command using the 3.10 image. Finally some progress. But eventually another error aborted the playbook and again it was a totally unexpected one.

etcd Backup Failure

Before updating the etcd cluster, there is a task which would backup the etcd database and this failed miserably. It couldn’t run docker exec etcd_container etcdctl backup [...]. When executing the command manually on a master host, I received the same error message:

rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""

My first suspicion was the Ansible role. Maybe the backup command is wrong? But I couldn’t find any radical changes in the commit history regarding the etcd backup and a blocking issue like this is unlikely to stay unnoticed for such a long time. Maybe something wrong with the image? The OpenShift Origin 3.9 setup was using a rather atypical image at that time, the only one from the Fedora image registry (registry.fedoraproject.org/latest/etcd:latest). When I see the ‘latest’ tag being used with containers I’m instantly suspicious that bugs might sneak in unnoticed as different users may get different images depending on when they are pulling them. Maybe they mistakenly pushed an image without a shell or without the etcdctl binary? So I tried to ask in the #openshift IRC channel on Freenode if someone experienced the same issue before but didn’t get any reply. Suddenly I had an idea: Only a few hours before I was using the etcdctl tool from the Atomic host to do my own etcd backup. I just need to find a way to make Ansible use the etcdctl from the host and everything would be fine. So I was digging a bit in the Ansible etcd role and a few minutes later I set r_etcd_common_etcdctl_command to "etcdctl" in my inventory, being confident that this would fix my issue. It won’t, but I won’t find out anytime soon…

The Master API cannot find the LDAP CA Certificate

In the next attempt, the playbook happily ran the etcd backup, upgraded the etcd images, converted them to be a static pod on all masters and did the same for the other two control plane services, the API service and the controller service, starting on the first master. Eventually the ‘origin-master-api’ and ‘origin-master-controller’ services were shutdown and the corresponding pods should be started, so the playbook was waiting for the API service to come up and waited and waited… The pod didn’t come up. Hmpf. Time has already come to use the new debugging command I read about in the release notes to see what’s going on:

# /usr/local/bin/master-logs api api

That is an alternative for the corresponding oc command that I’m also able to run from my client machine:

$ oc logs master-api-node01.example.com -n kube-system

But the latter one was behaving weird. Sometimes it hung although the API services of the two other nodes were still up. There is definitely something wrong.

When checking the logs locally, I saw an error that my FreeIPA CA certificate which should be used to validate the LDAPS connections cannot be found. That’s strange. I explicitly configured the ca key in the openshift_master_identity_providers variable pointing it to the correct CA certificate. I did this in other OpenShift cluster inventories before and there it was working… But those were not running OpenShift 3.10 or later. With 3.10 the playbook developers removed the possibility to custom-name the CA certificate so the ca key from the inventory was silently ignored. Only after checking the installation instructions regarding Configuring identity providers with Ansible, I found an inconspicuous comment that the CA certificate destination path now follows a given naming convention. When I was adding the identity provider configuration to the inventory before the update, I didn’t specify the openshift_master_openid_ca or openshift_master_openid_ca_file variables which will ensure that the CA certificate is copied to the correct place. After all the certificates were already on the master hosts and the identity provider was working, so I didn’t want the upgrade playbook to touch the certificate. Now that’s the result: My mistake. Still, I like issues that are clear and can be fixed so easily. A quick rename of the certificate on the master hosts made the API service successfully start again.

How a Docker Bug broke the etcd Cluster

All API servers are running again, although only the first one in the final configuration, but the oc command invocation still feel sluggish. Sometimes it even hangs completely. When checking the process list it attracted my attention that the etcd processes are only a few minutes old and sometimes they are not running at all. So I was checking the etcd cluster health and here it is: Two cluster members are down and one is in the state unhealthy. That is bad… Immediately, I started manually triggering etcd restarts. But only a few minutes later they shutdown again. I was checking the log files and there were errors, but I couldn’t figure out a single reason what might cause this mess. Then I found that the /etc/etcd/etcd.conf was updated during the playbook run, so I restored the backup, but again it wouldn’t fix the issue. Eventually I started to accept the thought that I might need to completely restore the etcd database from a backup because the database might already be so corrupt in the mean time that it is not able to find a stable state anymore.

The OKD documentation for Restoring etcd quorum would be the correct guide that you need to follow in this situation, but for a reason I landed at Restoring etcd. That confronted me with yet another issue: This guide was not yet properly updated for OpenShift 3.10. Some parts of the documentation still reference etcd as systemd service. But in my setup it’s a pod. Trying to pass the --force-new-cluster parameter to the etcd process via systemd override obviously doesn’t have any effect. Eventually I found out about the /etc/origin/node/pod/etcd.yaml file which contains the pod definition. And here we are able to correctly pass the parameter so that it is picked up by the pod startup command. But again, even with an empty database, a few minutes later my pod would die again. Something is badly broken here. In the YAML definition I also found the liveness probe. So once the pod was started once more, I tried to execute the liveness probe to see what it returns and the result looked familiar, in a bad way:

rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""

Ouch! Now I understand why the pods keep restarting. It’s again the same error that also caused the etcd backup failure before. But now I’m using the new quay.io/coreos/etcd:v3.2.22 image. This disproved my theory that a buggy image might be the reason. For the moment, I ran out of ideas… Until I remembered that I recently read a post about a docker bug (#1655214) that affected CentOS 7. Thanks for that! After checking the docker version on Atomic host 7.1811 (docker-1.13.1-84.git07f3374.el7.centos) it’s confirmed. That’s the root cause for so much trouble so far.

Updating Docker on Atomic Host

I didn’t need to dig too much into Atomic Host so far, as most of the stuff was simply working or was easily fixed with an update in the past. But this time it didn’t look that there is an imminent update. Release 7.1811 was only a few days old. I could roll back, but the previous version is 7.1808. That’s three months back and somehow defeats the purpose of my update, to get the latest security fixes. Fortunately CentOS already released new docker packages where this bug is fixed. Now I just need to find a way to update the docker packages independently from the ostree image? This time the documentation gods were on my side. I quickly found Dusty Mabe’s Atomic Host 101 Lab Part 4: Package Layering, Experimental Features.

Here my guide for quickly working around Bug #1655214 by updating the docker packages to release 1.13.1-88.git07f3374.el7.centos on CentOS Atomic Host:

  1. Create a temporary directory and download the corresponding RPM packages from a mirror of your choice:
    # mkdir /tmp/docker-1.13.1-88
    # cd /tmp/docker-1.13.1-88
    # for pkg in docker docker-client docker-common docker-lvm-plugin docker-novolume-plugin ; do \
        curl -O https://mirror.init7.net/centos/7/extras/x86_64/Packages/$pkg-1.13.1-88.git07f3374.el7.centos.x86_64.rpm ; \
      done
    
  2. From within the directory run rpm-ostree override replace to replace the docker packages from the ostree layer with the new RPMs:
    # rpm-ostree override replace docker*
    Checking out tree ee5a6f2... done
    Inactive requests:
      docker (already provided by docker-2:1.13.1-84.git07f3374.el7.centos.x86_64)
    Enabled rpm-md repositories: base updates extras
    Updating metadata for 'base': [=============] 100%
    rpm-md repo 'base'; generated: 2018-11-25 16:00:34
    Updating metadata for 'updates': [=============] 100%
    rpm-md repo 'updates'; generated: 2018-12-10 15:34:27
    Updating metadata for 'extras': [=============] 100%
    rpm-md repo 'extras'; generated: 2018-12-10 16:00:03
    Importing metadata [=============] 100%
    Resolving dependencies... done
    Relabeling (5/5) [=============] 100%
    Applying 5 overrides
    Processing packages (10/10) [=============] 100%
    Running pre scripts... 1 done
    Running post scripts... 5 done
    Writing rpmdb... done
    Writing OSTree commit... done
    Copying /etc changes: 42 modified, 5 removed, 613 added
    Transaction complete; bootconfig swap: no; deployment count change: 0
    Freed: 580.5 kB (pkgcache branches: 0)
    Upgraded:
      docker 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
      docker-client 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
      docker-common 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
      docker-lvm-plugin 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
      docker-novolume-plugin 2:1.13.1-84.git07f3374.el7.centos -> 2:1.13.1-88.git07f3374.el7.centos
    Run "systemctl reboot" to start a reboot
    
  3. Reboot the host.

I carefully did this on one master server after the other and surprisingly all the services (except etcd) were started normally. Even my GlusterFS pods came up again as nothing had happened. But still, the etcd cluster was offline and with it the Master API was inaccessible. No oc commands were possible.

Fixing the etcd Cluster

With the docker issue being fixed, I now had to bring up the etcd cluster again. The database was likely in a confused state because of all the failed attempts before, so I decided to restore a known good state. As briefly mentioned before, to do so, you actually create a new cluster with the database from a backup. Because the OpenShift documentation on how to do this cannot be followed easily, I list the exact steps below how I manged to do it:

  1. Make sure that all the etcd processes are down and not coming up again automatically. On an OpenShift 3.10 cluster, you prevent automatic startup by moving the /etc/origin/node/pod/etcd.yaml definition to a backup location e.g. /etc/origin/node/pod/disabled/ on every etcd host.
  2. First create a new one node etcd cluster on the first etcd host. To do so, we need some preparation:
    • The /etc/etcd/etcd.conf configuration must not contain any previous configurations regarding the INITIAL_CLUSTER or INITIAL_CLUSTER_STATE. I was able to simply use the etcd.conf generated by the upgrade playbook which already set those two variables to the correct value:
      ETCD_INITIAL_CLUSTER=
      ETCD_INITIAL_CLUSTER_STATE=new
      

      Also make sure, the ETCD_INITIAL_ADVERTISE_PEER_URLS only contains the URL of the first host itself and no other peers:

      ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.0.10:2380
      
    • Restore the etcd database from a backup. Fortunately the upgrade playbook automatically created a backup after the etcd upgrade, so I’m going to restore to that state:
      # mv /var/lib/etcd/member /var/lib/etcd/member.orig
      # cp -rP /var/lib/etcd/openshift-backup-post-3.0-20181214022846/member /var/lib/etcd/
      
    • When starting the first etcd member for the first time, we need to pass the --force-new-cluster argument to the process. This will override the cluster definition from the database files. To do so, the etcd.yaml file has to be adjusted. Here the important snippet (everything else should be kept as it is):
      spec:
        containers:
        - args:
          - '#!/bin/sh
      
            set -o allexport
      
            source /etc/etcd/etcd.conf
      
            exec etcd --force-new-cluster
      
            '
      
    • If everything is ready to start the etcd process, move the altered etcd.yaml file back to the /etc/origin/node/pod directory. Within a few moments, the pod should startup and create a new cluster.
  3. Check the initial cluster state via:
    # etcdctl2 cluster-health
    member 67aa8b8cc701 is healthy: got healthy result from https://10.0.0.10:2379
    

    If something went wrong, you might want to check the logs via:

    # /usr/local/bin/master-logs etcd etcd
    
  4. Initially the first member still advertises a PeerURL pointing to ‘localhost’:
    # etcdctl2 member list
    67aa8b8cc701: name=master01.example.com peerURLs=http://localhost:2380 clientURLs=https://10.0.0.10:2379 isLeader=true
    

    This must be updated by the correct host URL pointing to itself:

    # etcdctl2 member update 67aa8b8cc701 https://10.0.0.10:2380
    Updated member with ID 67aa8b8cc701 in cluster
    

    Then it correctly shows:

    # etcdctl2 member list
    67aa8b8cc701: name=master01.example.com peerURLs=https://10.0.0.10:2380 clientURLs=https://10.0.0.10:2379 isLeader=true
    
  5. This configuration was automatically saved in the database. So the --force-new-cluster argument can be removed again. Edit the etcd.yaml in-place to restore the original configuration. After doing so, restart the etcd process with:
    # /usr/local/bin/master-restart etcd
    

    If it comes up again and shows healthy, we can continue the add the other two cluster members.

  6. The following steps to add another cluster member obviously have to be done on for both other etcd hosts:
    1. Add the new host to the cluster by executing the following command on the first etcd host:
      # etcdctl2 member add master02.example.com https://10.0.0.11:2380
      Added member name master02.example.com with ID a6b2e8d0d392083b to cluster
      
      ETCD_NAME="member02.example.com"
      ETCD_INITIAL_CLUSTER="member01.example.com=https://10.0.0.10:2380,member02.example.com=https://10.0.0.11:2380"
      ETCD_INITIAL_CLUSTER_STATE="existing"
      

      The new member will then be displayed as ‘unstarted’ in the member list.

    2. Prepare the /etc/etcd/etcd.conf file on the new etcd host by defining the variables as shown in the output of the etcdctl2 member add command above. The ETCD_INITIAL_CLUSTER value will automatically be extended with each new member added to the cluster.
    3. Delete the old database on the new etcd host. It will automatically be synced from the other cluster members once the new node has joined:
      # mv /var/lib/etcd/member /var/lib/etcd/member.orig
      
    4. Enable the etcd pod by moving the etcd.yaml back to /etc/origin/node/pod. Within a few minutes the etcd process should be started and eventually join the etcd cluster.

Once the etcd cluster was restored, the oc command was finally working again and I could check the state of the etcd pods also via OpenShift client:

$ oc get pods -n kube-system | grep etcd
master-etcd-master01.example.com          1/1       Running   5          1h
master-etcd-master02.example.com          1/1       Running   0          47m
master-etcd-master03.example.com          1/1       Running   0          2m

During the entire time the etcd cluster was down the OpenShift cluster continued running. The registry, routers and applications such as my Gitea setup were online all the time and even the CNS cluster running on the master hosts handled the debugging and restart session with bravery. Fortunately I had a super static setup during that time and so no deployments or replica count enforcement needed to be executed which would have been impossible anyway. Still I feel it’s a positive fact that shows the resiliency the platform has gained over time.

Finishing the Control Plane Upgrade

After a longer detour, I was finally back at the point were I could start another run of the control plane upgrade playbook. Remember, when the playbook aborted before it did so after upgrading the control plane services on the first master node, there are still two to go. So I started the playbook once again.

By now I have a really good feeling about the state of the playbook in this release. As you can see above, it failed on me many times in all different stages of the update, but it always had a good reason and it was always able to pick up where it left. My experience with initial upgrade attempts of earlier OpenShift releases was unfortunately not always that good. For example it happened to me that I had to restore a master host from a snapshot, because the playbook failed to correctly detect the upgrade state in the second run, after it aborted the first run due to a syntax error in a post-upgrade task.

This time the playbook finished successfully and my control plane was finally at release 3.10:

# /usr/local/bin/oc version
oc v3.10.0+c99b16a-90
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://openshift.example.com:8443
openshift v3.10.0+c99b16a-90
kubernetes v1.10.0+b81c8f8

Running the Node Upgrade Playbook

After the control-plane was done, I had to upgrade the infrastructure and compute nodes. A separate playbook placed at playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_nodes.yml is available. Initially I only wanted to run it on a single node to make sure everything works as expected. This can be done by passing the -e openshift_upgrade_nodes_label=kubernetes.io/hostname=node03.example.com argument, where the given host name is obviously the node that should be upgraded, to the playbook execution command. The playbook completed without error already on the first attempt. So I continued with the other nodes.

One fact is super important when upgrading the nodes to OpenShift 3.10. The /etc/origin/node/node-config.yaml is completely regenerated based on the settings in the corresponding node group (and/or the defaults) and so any prior adjustment not reflected in the inventory is lost. Therefore make sure that you perfectly understand the Node Group concept and how it affects your node layout and configuration.

To give you an example how to customize the upgrade behavior on the infrastructure nodes, I added the following arguments to the playbook execution: -e openshift_upgrade_nodes_label=region=infra -e openshift_upgrade_nodes_serial=50%.

Fixing the Infrastructure Node Selector

It confused me that the NodeSelector of the infrastructure components such as the registry and routers were not updated to the new defaults. In the inventory I explicitly defined the new node selector:

openshift_hosted_registry_selector='node-role.kubernetes.io/infra=true'

But when checking the DeploymentConfig of the registry, I can still find the old NodeSelector:

$ oc get dc docker-registry -n default -o json | jq .spec.template.spec.nodeSelector
{
  "region": "infra"
}

So I manually triggered an update of the NodeSelector property in all the DeploymentConfigs using it. E.g.:

$ oc patch dc docker-registry -n default --type json --patch '[{"op":"replace","path":"/spec/template/spec/nodeSelector","value":{"node-role.kubernetes.io/infra":"true"}}]'
deploymentconfig "docker-registry" patched

NodeSelectors can also be set in DaemonSets, as annotations in projects or even globally via master-config.yaml. Therefore make sure to update them all, when required, before removing any labels from the nodes.

After checking that all the pods are up and running again, I was finally able to remove the old infrastructure labels from the nodes:

$ oc label node node01.example.com region- zone-

Summarizing

This was not my first OpenShift update ever, but my first update from 3.9 to 3.10. This obviously means that I made some mistakes and had wrong assumptions from which I did learn a lot. I hope I could share some insights and useful hints for those of you that haven’t done this before. Otherwise it will at least help me in the future to run this update an other cluster much smoother.

At the end some advice for those of you who also need to do such an upgrade:

  • You need to have a test cluster where you can practice such updates. It doesn’t need to be big but the Ansible inventory variables should be structurally as similar as possible to those of the production cluster. As you saw above, a lot of errors just happened due to wrong inventory variables. Ideally the test cluster should have some workload so that you experience how the applications behave during the update and and so that you can test if everything still works after an upgrade.
  • Emphasis your Ansible inventory. Everything of your configuration that can fit into the Ansible inventory must be defined there and must be maintained there. It can cost you a lot of time debugging or even result in application downtime during an upgrade if you manually updated the cluster configuration without adjusting the configuration in the inventory. Even when it sometimes feels like it’s more work than benefit it’s always worth it.
  • Preparation is key. Carefully read through the upstream documentation available. Most likely you also have some internal documentation where your infrastructure specifics are written down. Run the upgrade on a test cluster before you do it in production. If it doesn’t work on the first attempt, update your notes and try it again. Try to gain as much experience as possible on the test infrastructure so that you already know what to do if something goes wrong in production.
  • Plan a lot of time. Doing such an upgrade is a lot of work! Give yourself enough time to do a proper preparation and also the actual upgrade window itself should give you enough time to fix issues when they arise. Plan in the scale of hours or better days. Ansible is slow. If you have to restart the playbook because of an error after 15 minutes this will eat up your time fast.

Thanks for reading. As always I’d welcome some feedback or critics in the comments.

Feb 152018
 

The recently disclosed Spectre and Meltdown CPU vulnerabilities are some of the most dramatic security issues in the recent computer history. Fortunately even six weeks after public disclosure sophisticated attacks exploiting these vulnerabilities are not yet common to observe. Fortunately, because the hard- and software vendors are still stuggling to provide appropriate fixes.

If you happen to run a Linux system, an excellent tool for tracking your vulnerability as well as the already active mitigation strategies is the spectre-meltdown-checker script originally written and maintained by Stéphane Lesimple.

Within the last month I set myself the target to bring this script to Fedora and EPEL so it can be easily consumed by the Fedora, CentOS and RHEL users. Today it finally happend that the spectre-meltdown-checker package was added to the EPEL repositories after it is already available in the Fedora stable repositories since one week.

On Fedora, all you need to do is:

dnf install spectre-meltdown-checker

After enabling the EPEL repository on CentOS this would be:

yum install spectre-meltdown-checker

The script, which should be run by the root user, will report:

    • If your processor is affected by the different variants of the Spectre and Meltdown vulnerabilities.
    • If your processor microcode tries to mitigate the Spectre vulnerability or if you run a microcode which
      is known to cause stability issues.
    • If your kernel implements the currently known mitigation strategies and if it was compiled with a compiler which is hardening it even more.
    • And eventually if you’re (still) affected by some of the vulnerability variants.
  • On my laptop this currently looks like this (Note, that I’m not running the latest stable Fedora kernel yet):

    # spectre-meltdown-checker                                                                                                                                
    Spectre and Meltdown mitigation detection tool v0.33                                                                                                                      
                                                                                                                                                                              
    Checking for vulnerabilities on current system                                       
    Kernel is Linux 4.14.14-200.fc26.x86_64 #1 SMP Fri Jan 19 13:27:06 UTC 2018 x86_64   
    CPU is Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz                                      
                                                                                                                                                                              
    Hardware check                            
    * Hardware support (CPU microcode) for mitigation techniques                         
      * Indirect Branch Restricted Speculation (IBRS)                                    
        * SPEC_CTRL MSR is available:  YES    
        * CPU indicates IBRS capability:  YES  (SPEC_CTRL feature bit)                   
      * Indirect Branch Prediction Barrier (IBPB)                                        
        * PRED_CMD MSR is available:  YES     
        * CPU indicates IBPB capability:  YES  (SPEC_CTRL feature bit)                   
      * Single Thread Indirect Branch Predictors (STIBP)                                                                                                                      
        * SPEC_CTRL MSR is available:  YES    
        * CPU indicates STIBP capability:  YES                                           
      * Enhanced IBRS (IBRS_ALL)              
        * CPU indicates ARCH_CAPABILITIES MSR availability:  NO                          
        * ARCH_CAPABILITIES MSR advertises IBRS_ALL capability:  NO                                                                                                           
      * CPU explicitly indicates not being vulnerable to Meltdown (RDCL_NO):  UNKNOWN    
      * CPU microcode is known to cause stability problems:  YES  (Intel CPU Family 6 Model 61 Stepping 4 with microcode 0x28)                                                
                                              
    The microcode your CPU is running on is known to cause instability problems,         
    such as intempestive reboots or random crashes.                                      
    You are advised to either revert to a previous microcode version (that might not have
    the mitigations for Spectre), or upgrade to a newer one if available.                
    
    * CPU vulnerability to the three speculative execution attacks variants
      * Vulnerable to Variant 1:  YES 
      * Vulnerable to Variant 2:  YES 
      * Vulnerable to Variant 3:  YES 
    
    CVE-2017-5753 [bounds check bypass] aka 'Spectre Variant 1'
    * Mitigated according to the /sys interface:  NO  (kernel confirms your system is vulnerable)
    > STATUS:  VULNERABLE  (Vulnerable)
    
    CVE-2017-5715 [branch target injection] aka 'Spectre Variant 2'
    * Mitigated according to the /sys interface:  YES  (kernel confirms that the mitigation is active)
    * Mitigation 1
      * Kernel is compiled with IBRS/IBPB support:  NO 
      * Currently enabled features
        * IBRS enabled for Kernel space:  NO 
        * IBRS enabled for User space:  NO 
        * IBPB enabled:  NO 
    * Mitigation 2
      * Kernel compiled with retpoline option:  YES 
      * Kernel compiled with a retpoline-aware compiler:  YES  (kernel reports full retpoline compilation)
      * Retpoline enabled:  YES 
    > STATUS:  NOT VULNERABLE  (Mitigation: Full generic retpoline)
    
    CVE-2017-5754 [rogue data cache load] aka 'Meltdown' aka 'Variant 3'
    * Mitigated according to the /sys interface:  YES  (kernel confirms that the mitigation is active)
    * Kernel supports Page Table Isolation (PTI):  YES 
    * PTI enabled and active:  YES 
    * Running as a Xen PV DomU:  NO 
    > STATUS:  NOT VULNERABLE  (Mitigation: PTI)
    
    A false sense of security is worse than no security at all, see --disclaimer
    

    The script also supports a mode which outputs the result as JSON, so that it can easily be parsed by any compliance or monitoring tool:

    # spectre-meltdown-checker --batch json 2>/dev/null | jq
    [
      {
        "NAME": "SPECTRE VARIANT 1",
        "CVE": "CVE-2017-5753",
        "VULNERABLE": true,
        "INFOS": "Vulnerable"
      },
      {
        "NAME": "SPECTRE VARIANT 2",
        "CVE": "CVE-2017-5715",
        "VULNERABLE": false,
        "INFOS": "Mitigation: Full generic retpoline"
      },
      {
        "NAME": "MELTDOWN",
        "CVE": "CVE-2017-5754",
        "VULNERABLE": false,
        "INFOS": "Mitigation: PTI"
      }
    ]
    

    For those who are (still) using a Nagios-compatible monitoring system, spectre-meltdown-checker also supports to be run as NRPE check:

    # spectre-meltdown-checker --batch nrpe 2>/dev/null ; echo $?
    Vulnerable: CVE-2017-5753
    2
    

    I just mailed to Stéphane and he will soon release version 0.35 with many new features and fixes. As soon as it will be released I’ll submit a package update, so that you’re always up to date with the latest developments.