Feb 242020
 

This is the second part of my field report about installing the oVirt 4.4-alpha release on CentOS 8 in a hyperconverged setup. In the first part I was focusing on setting up the GlusterFS storage cluster and now I’m going to describe my experience of the self-hosted engine installation.

If you think about repeating this installation on your hardware please let me remind you: This software is currently in alpha status. This means there are likely still many bugs and rough edges and if you succeed to install it successfully there is no guarantee for updates not breaking everything again. Please don’t try this anywhere close to production systems or data. I won’t be able to assist you in any way if things turn out badly.

Cockpit Hosted-Engine Wizard

Before we can start installing the self-hosted engine, we need to install a few more packages:

# dnf install ovirt-engine-appliance vdsm-gluster

Similar to the GlusterFS setup, also the hosted engine setup can be done from the Cockpit Web interface:

The wizard is also here pretty self-explanatory. There are a few options missing in Web-UI compared to the commandline installer (hosted-engine --deploy) e.g. you cannot customize the name of the libvirt domain which is called ‘HostedEngine’ by default. You must give the common details such as hostname, VM resources, some network settings, credentials for the VM and oVirt and that’s pretty much it:

Before you start deploying the VM there is also a quick summary of the settings and then an answer file will be generated. While the GlusterFS setup created a regular Ansible inventory the hosted engine setup has its own INI-format. It’s a useful feature that even when the deployment aborts, it can always be restarted from the Web interface without the need to fill in the form again and again. Indeed, I used this to my advantage a lot because it took me at least 20 attempts before the hosted engine VM was setup successfully.

Troubleshooting hosted-engine issues

Once VM deployment was running I found that the status output in the Cockpit Web interface heavily resembled Ansible output. It seems that a big part of the deployment code in the hosted-engine tool was re-implemented now using the ovirt-ansible-hosted-engine-setup Ansible roles in the background. If you’re familiar with Ansible this definitely simplifies troubleshooting and also allows a better understanding what is going on. Unfortunately there is still a layer of hosted-engine code above Ansible so that I couldn’t figure out, if it’s possible to run a playbook from the shell that would do the same setup.

Obviously it didn’t take long for an issue to pop-up:

  • The first error was that Ansible couldn’t connect to the hosted-engine VM that was freshly created from the oVirt appliance disk image. The error output in the Web interface is rather limited but also here a log file exists that can be found at a path like /var/log/ovirt-engine/setup/ovirt-engine-setup-20200218154608-rtt3b7.log. In the log file I found:
    2020-02-18 15:32:45,938+0100 DEBUG ansible on_any args localhostTASK: ovirt.hosted_engine_setup : Wait for the local VM kwargs 
    2020-02-18 15:35:52,816+0100 ERROR ansible failed {
        "ansible_host": "localhost",
        "ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
        "ansible_result": {
            "_ansible_delegated_vars": {
                "ansible_host": "ovirt.oasis.home"
            },
            "_ansible_no_log": false,
            "changed": false,
            "elapsed": 185,
            "msg": "timed out waiting for ping module test success: Using a SSH password instead of a key is not possible because Host Key checking is enabled and sshpass does not support this.  Please add this host's fingerprint to your know
    n_hosts file to manage this host."
        },
        "ansible_task": "Wait for the local VM",
        "ansible_type": "task",
        "status": "FAILED",
        "task_duration": 187
    }

    A manual SSH login with the root account on the VM was possible after accepting the fingerprint. Maybe this is still a bug or I missed a setting somewhere, but the easiest way to solve this was to create a ~root/.ssh/config file on the hypervisor host with the following content. The hostname is the hosted-engine FQDN:

    Host ovirt.oasis.home
        StrictHostKeyChecking accept-new
    

    Each installation attempt will make sure that the previous host key is deleted from the known_hosts file so no need to worry about changing keys on multiple installation tries. The deployment could simply be restarted by pressing the “Prepare VM” button once again.

  • During the next run the connection to the hosted-engine VM succeeded and it nearly completed all of the setup task within the VM but then failed when trying to restart the ovirt-engine-dwhd service:
    2020-02-18 15:48:45,963+0100 INFO otopi.plugins.ovirt_engine_setup.ovirt_engine_dwh.core.service service._closeup:52 Starting dwh service
    2020-02-18 15:48:45,964+0100 DEBUG otopi.plugins.otopi.services.systemd systemd.state:170 starting service ovirt-engine-dwhd
    2020-02-18 15:48:45,965+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:813 execute: ('/usr/bin/systemctl', 'start', 'ovirt-engine-dwhd.service'), executable='None', cwd='None', env=None
    2020-02-18 15:48:46,005+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:863 execute-result: ('/usr/bin/systemctl', 'start', 'ovirt-engine-dwhd.service'), rc=1
    2020-02-18 15:48:46,006+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:921 execute-output: ('/usr/bin/systemctl', 'start', 'ovirt-engine-dwhd.service') stdout:
    
    
    2020-02-18 15:48:46,006+0100 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:926 execute-output: ('/usr/bin/systemctl', 'start', 'ovirt-engine-dwhd.service') stderr:
    Job for ovirt-engine-dwhd.service failed because the control process exited with error code. See "systemctl status ovirt-engine-dwhd.service" and "journalctl -xe" for details.
    
    2020-02-18 15:48:46,007+0100 DEBUG otopi.context context._executeMethod:145 method exception
    Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
        method['method']()
      File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine-dwh/core/service.py", line 55, in _closeup
        state=True,
      File "/usr/share/otopi/plugins/otopi/services/systemd.py", line 181, in state
        service=name,
    RuntimeError: Failed to start service 'ovirt-engine-dwhd'
    2020-02-18 15:48:46,008+0100 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Closing up': Failed to start service 'ovirt-engine-dwhd'
    2020-02-18 15:48:46,009+0100 DEBUG otopi.context context.dumpEnvironment:765 ENVIRONMENT DUMP - BEGIN
    2020-02-18 15:48:46,010+0100 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/error=bool:'True'
    2020-02-18 15:48:46,010+0100 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/exceptionInfo=list:'[(, RuntimeError("Failed to start service 'ovirt-engine-dwhd'",), )]'
    2020-02-18 15:48:46,012+0100 DEBUG otopi.context context.dumpEnvironment:779 ENVIRONMENT DUMP - END
    

    Fortunately I was able to login to the hosted-engine VM and found the following blunt error:

    -- Unit ovirt-engine-dwhd.service has begun starting up.
    Feb 18 15:48:46 ovirt.oasis.home systemd[30553]: Failed at step EXEC spawning /usr/share/ovirt-engine-dwh/services/ovirt-engine-dwhd/ovirt-engine-dwhd.py: Permission denied
    -- Subject: Process /usr/share/ovirt-engine-dwh/services/ovirt-engine-dwhd/ovirt-engine-dwhd.py could not be executed
    

    Indeed, the referenced script was not marked executable. Fixing it manually and restarting the service showed that this would succeed. But there is one problem. This change is not persisted. On the next deployment run, the hosted-engine VM will be deleted and re-created again. When searching for a nicer solution I found that this bug is actually already fixed in the latest release of ovirt-engine-dwh-4.4.0-1.el8.noarch.rpm but the appliance image (ovirt-engine-appliance-4.4-20200212182535.1.el8.x86_64) was only including ovirt-engine-dwh-4.4.0-0.0.master.20200206083940.el7.noarch and there is no newer appliance image. That’s part of the experience when trying alpha releases but it’s not a blocker. Eventually I found that there is a directory where you can place an Ansible tasks file which will be executed in the hosted-engine VM before the setup is run. So I created the file hooks/enginevm_before_engine_setup/yum_update.yml in the /usr/share/ansible/roles/ovirt.hosted_engine_setup/ directory with the following content:

    ---
    - name: Update all packages
      package:
      name: '*'
      state: latest
    

    From then on each deployment attempt was first updating the packages including ‘ovirt-engine-dwh’ in the VM before the hosted-engine would continue to configure and restart the service.

  • The next issue was suddenly appearing when I tried to re-run the deployment. The Ansible code would fail early with an error that it cannot update the routing rules on the hypervisor:
    2020-02-18 16:17:38,330+0100 DEBUG ansible on_any args  kwargs 
    2020-02-18 16:17:38,664+0100 INFO ansible task start {'status': 'OK', 'ansible_type': 'task', 'ansible_playbook': '/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml', 'ansible_task': 'ovirt.hosted_engine_setup : Add IPv4 outbo
    und route rules'}
    2020-02-18 16:17:38,664+0100 DEBUG ansible on_any args TASK: ovirt.hosted_engine_setup : Add IPv4 outbound route rules kwargs is_conditional:False 
    2020-02-18 16:17:38,665+0100 DEBUG ansible on_any args localhostTASK: ovirt.hosted_engine_setup : Add IPv4 outbound route rules kwargs 
    2020-02-18 16:17:39,214+0100 DEBUG var changed: host "localhost" var "result" type "" value: "{
        "changed": true,
        "cmd": [
            "ip",
            "rule",
            "add",
            "from",
            "192.168.222.1/24",
            "priority",
            "101",
            "table",
            "main"
        ],
        "delta": "0:00:00.002805",
        "end": "2020-02-18 16:17:38.875350",
        "failed": true,
        "msg": "non-zero return code",
        "rc": 2,
        "start": "2020-02-18 16:17:38.872545",
        "stderr": "RTNETLINK answers: File exists",
        "stderr_lines": [
            "RTNETLINK answers: File exists"
        ],
        "stdout": "",
        "stdout_lines": []
    }"
    

    So I was checking the rules manually and yes, they were already there. I thought that’s an easy case, that must be a simple idempotency issue in the Ansible code. But when looking up the code there was already a condition in place that should prevent this case from happening. Even after multiple attempts to debug this code, I couldn’t find the reason why this check is failing. Eventually I found the GitHub pull request #96 where someone was already refactoring this code with a commit message “Hardening existing ruleset lookup”. So I forward-ported the patch to the release 1.0.35 which fixed the problem. The PR is already open for more than a year with no indication that it would be merged soon, so I still reported the issue in ovirt-ansible-hosted-engine-setup #289.
    I only found out about ovirt-hosted-engine-cleanup a few hours later, so with its help you can easily work-around this issue by cleaning up the installation before another retry.

  • Another though to debug but easy to fix issue popped up after the hosted-engine VM setup completed and the Ansible role was checking the oVirt events for errors:
    2020-02-19 01:46:53,723+0100 ERROR ansible failed {
        "ansible_host": "localhost",
        "ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
        "ansible_result": {
            "_ansible_no_log": false,
            "changed": false,
            "msg": "The host has been set in non_operational status, deployment errors:   code 4035: Gluster command [] failed on server .,    code 10802: VDSM loki.oasis.home command GlusterServersListVDS failed: The method does not exist or is not available: {'method': 'GlusterHost.list'},   fix accordingly and re-deploy."
        },
        "ansible_task": "Fail with error description",
        "ansible_type": "task",
        "status": "FAILED",
        "task_duration": 0
    }
    

    This error is not in the Ansible code anymore but the engine itself fails to query the GlusterFS status on the hypervisor. This is done via VDSM, a daemon that runs on each oVirt hypervisor and manages the hypervisor configuration and status. Maybe the VDSM log (/var/log/vdsm/vdsm.log) reveals more insights:

    2020-02-19 01:46:45,786+0100 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getCapabilities succeeded in 3.33 seconds (__init__:312)
    2020-02-19 01:46:45,981+0100 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call GlusterHost.list failed (error -32601) in 0.00 seconds (__init__:312)
    

    Seems that regular RPC calls to VDSM are successful but only the GlusterFS query is failing. I tracked down the source code of this implementation and found that there is a CLI command that can be used to run this query:

    # vdsm-client --gluster-enabled -h
    Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/vdsmclient/client.py", line 276, in find_schema
        with_gluster=gluster_enabled)
      File "/usr/lib/python3.6/site-packages/vdsm/api/vdsmapi.py", line 156, in vdsm_api
        return Schema(schema_types, strict_mode, *args, **kwargs)
      File "/usr/lib/python3.6/site-packages/vdsm/api/vdsmapi.py", line 142, in __init__
        with io.open(schema_type.path(), 'rb') as f:
      File "/usr/lib/python3.6/site-packages/vdsm/api/vdsmapi.py", line 95, in path
        ", ".join(potential_paths))
    vdsm.api.vdsmapi.SchemaNotFound: Unable to find API schema file, tried: /usr/lib/python3.6/site-packages/vdsm/api/vdsm-api-gluster.pickle, /usr/lib/python3.6/site-packages/vdsm/api/../rpc/vdsm-api-gluster.pickle

    Ah, that’s better. I love such error messages. Thanks to that it was not so hard to find, that I actually overlooked to install the vdsm-gluster package on the hypervisor.

That’s it after that the deployment completed successfully:

And finally a screenshot of the oVirt 4.4-alpha administration console. Yes, it works:

Conclusion

At the end the of the day most of the issues happened because I was not very familiar with the setup procedure and at the same time refused to follow any setup instructions for an older release. There were a minor bug with the ovirt-engine-dwh restart issue, that was already fixed upstream but didn’t made it yet into the hosted-engine appliance image. Something that is expected in an alpha release.

I also quickly setup some VMs to test the basic functionality of oVirt and couldn’t find any major issues so far. I guess most people using oVirt are much more experienced with it than me anyway, so there shouldn’t be any concerns in trying oVirt 4.4-alpha yourself. To me that was an interesting experience and I’m very happy about the Ansible integration that this project is pushing. It was also a nice experience to use Cockpit and I believe that’s definitely something that makes this product more appealing to setup and use for a wide range of IT professionals. As long as it can be done via command line too, I’ll be happy.

Feb 232020
 

For a while I had a oVirt server in a hyperconverged setup which means that the hypervisor was also running a GlusterFS storage server and that the oVirt management virtual machine was inside the oVirt cluster (self-hosted engine). On top of oVirt I was running a OKD cluster with a containerized GlusterFS cluster that could be used for persistent volumes by the container workload. All of this was running on a single hypervisor with a single SSD which unfortunately gave up on me after a few years in operation. Recently I stumbled upon the oVirt 4.4-alpha. Next to initial support for running it on CentOS 8 also a proper support for ignition that is used by (Fedora and Red Hat) CoreOS and therefore OpenShift/OKD 4 attracted my attention. Why not give it a try and see how far I come…? After a few hours of tinkering I succeeded the installation:

And now I’m going to describe what was necessary to do so. Not everything that I’ll mention is brand new. My past experience of setting up such a system is more or less based on the guide Up and Running with oVirt 4.0 and Gluster Storage that I was following a few years ago, so I’m also highlighting a few things that have changed since then.

If you think about repeating this installation on your hardware please let me remind you: This software is currently in alpha status. This means there are likely still many bugs and rough edges and if you succeed to install it successfully there is no guarantee for updates not breaking everything again. Please don’t try this anywhere close to production systems or data. I won’t be able to assist you in any way if things turn out badly.

I split this field report into two parts, the first one discussing the GlusterFS storage setup and the second one explaining my challenges when setting up the oVirt self-hosted-engine.

Hypervisor disk layout

I was using a minimal install of a CentOS 8 on a bare-metal server. Make sure you either have two disks or create a separate partition when installing CentOS so that the GlusterFS storage can life on its own block device. My disk layout looks something like this:

[root@loki ~]# lsblk
NAME                    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                       8:0    0 931.5G  0 disk 
├─sda1                    8:1    0   100M  0 part 
├─sda2                    8:2    0   256M  0 part /boot
├─sda3                    8:3    0    45G  0 part 
│ ├─vg_loki-slash       253:0    0    10G  0 lvm  /
│ ├─vg_loki-swap        253:1    0     4G  0 lvm  [SWAP]
│ ├─vg_loki-var         253:2    0     5G  0 lvm  /var
│ ├─vg_loki-home        253:3    0     2G  0 lvm  /home
│ └─vg_loki-log         253:4    0     2G  0 lvm  /var/log
└─sda4                    8:4    0   500G  0 part

The CentOS installation is placed on a LVM volume group (vg_loki on /dev/sda3) with some individual volumes for dedicated mount points and then there is /dev/sda4 which is an empty disk partition that will be used by GlusterFS later. If you wonder how to do such a setup with the CentOS 8 installer… I don’t know. I tried for a moment to somehow configure this setup in the installer, but eventually I gave up, manually partitioned the disk and created the volume group and then used the pre-generated setup in the installer which perfectly detected what I was doing on the shell.

Install software requirements

First you need to enable the oVirt 4.4-alpha package repository by installing the corresponding release package:

# dnf install https://resources.ovirt.org/pub/yum-repo/ovirt-release44-pre.rpm

Recently a lot of effort was invested into incorporating the oVirt setup into the Cockpit Web interface and it’s now even the recommended installation method for the downstream Red Hat Virtualization (RHV). When I setup my previous hyperconverged oVirt 4.0 this wasn’t available back then so of course I’m going to try this. To setup Cockpit and the oVirt integration the following packages need to be installed:

# dnf install cockpit cockpit-ovirt-dashboard glusterfs-server

After logging into Cockpit that runs on the hypervisor host on port 9090 there is a dedicated oVirt tab with two entries:

If you continue with the hyperconverged setup, there now even is a dedicated option to install a single node only GlusterFS “cluster”!

This was a big positive surprise to me because the previously used gdeploy tool was insisting on a three node GlusterFS cluster years ago.

Running the GlusterFS wizard

After this revelation the GlusterFS setup is supposedly straight forward. Still I ran into some issues that I could probably have avoided by carefully reading the installation instructions for oVirt 4.3. Nonetheless I’m quickly going to mention a few points here in case other people are struggling with the same and search the Web for these error messages:

  • On the first screen I had an error that the setup cannot proceed because "gluster-ansible-roles is not installed on Host":

    However, the related package including the Ansible roles from gluster-ansible was clearly there:

    # rpm -q gluster-ansible-roles
    gluster-ansible-roles-1.0.5-7.el8.noarch
    

    Eventually I found that my sudo rules for my unprivileged user account are not properly picked up by Cockpit so I restarted the setup by using the root account which then successfully detected the Ansible roles.

  • When selecting the brick setup the “Raid Type” must be changed to “JBOD” and the device name that was reserved for the GlusterFS storage must be entered:Eventually the wizard will create a dedicated LVM volume group for GlusterFS bricks and a LVM thin-pool volume if you wish so:
    # lvs gluster_vg_sda4
      LV                               VG              Attr       LSize    Pool                             Origin Data%  Meta%  Move Log Cpy%Sync Convert
      gluster_lv_data                  gluster_vg_sda4 Vwi-aot---  125.00g gluster_thinpool_gluster_vg_sda4        3.35                                   
      gluster_lv_engine                gluster_vg_sda4 -wi-ao----   75.00g                                                                                
      gluster_lv_vmstore               gluster_vg_sda4 Vwi-aot---  300.00g gluster_thinpool_gluster_vg_sda4        0.05                                   
      gluster_thinpool_gluster_vg_sda4 gluster_vg_sda4 twi-aot--- <421.00g                                         1.03   0.84
  • Before the Ansible playbook that will setup the storage is executed the generated inventory file will be displayed. Because I'm very familiar with Ansible anyway, I love this part! It also makes it easier to understand which playbook command to run when troubleshooting something on the command line where the Cockpit Web interface doesn't make sense to be involved anymore:
    The "Enable Debug Logging" is definitely worth to be enabled especially if you run this the first time on your server. It gives you much more insight in what Ansible is actually doing on your hypervisor.
  • When finally running the "Deploy" step it didn't take long to fail the playbook with the following error:
    TASK [Check if provided hostnames are valid] ***********************************
    task path: /usr/share/cockpit/ovirt-dashboard/ansible/hc_wizard.yml:29
    fatal: [loki.oasis.home]: FAILED! => {"msg": "The conditional check 'result.results[0]['stdout_lines'] > 0' failed. The error was: Unexpected templating type error occurred on ({% if result.results[0]['stdout_lines'] > 0 %} True {% else %} False {% endif %}): '>' not supported between instances of 'list' and 'int'"}
    

    The involved code is untouched since a long time. So I reported the issue at RHBZ #1806298. I'm not sure how this could ever work but the fix is also trivial. After changing the following lines this task was successfully passing:

    --- /usr/share/cockpit/ovirt-dashboard/ansible/hc_wizard.yml.orig       2020-02-18 14:48:33.678471259 +0100
    +++ /usr/share/cockpit/ovirt-dashboard/ansible/hc_wizard.yml    2020-02-18 14:48:55.810456470 +0100
    @@ -30,7 +30,7 @@
           assert:
             that:
               - "result.results[0]['rc'] == 0"
    -          - "result.results[0]['stdout_lines'] > 0"
    +          - "result.results[0]['stdout_lines'] | length > 0"
             fail_msg: "The given hostname is not valid FQDN"
           when: gluster_features_fqdn_check | default(true)

    Btw. the error message can not only be seen in the Web UI but is also written to a log file: /var/log/cockpit/ovirt-dashboard/gluster-deployment.log

  • There was another issue that cost me more effort to track down. The deployment playbook was failing to add the firewalld rules for GlusterFS:
    TASK [gluster.infra/roles/firewall_config : Add/Delete services to firewalld rules] ***
    task path: /etc/ansible/roles/gluster.infra/roles/firewall_config/tasks/main.yml:24
    failed: [loki.oasis.home] (item=glusterfs) => {"ansible_loop_var": "item", "changed": false, "item": "glusterfs", "msg": "ERROR: Exception caught: org.fedoraproject.FirewallD1.Exception: INVALID_SERVICE: 'glusterfs' not among existing services Permanent and Non-Permanent(immediate) operation, Services are defined by port/tcp relationship and named as they are in /etc/services (on most systems)"}

    Indeed firewalld doesn't know about a 'glusterfs' service:

    # rpm -q firewalld
    firewalld-0.7.0-5.el8.noarch
    # firewall-cmd --get-services | grep glusterfs
    #
    

    Is this an old version? From where do I get the 'glusterfs' firewalld service definition? The solution to this is as simple as embarrassing. I found that the service definition is packaged as part of the glusterfs-server RPM which was still missing on my server. After installing it also this issue was solved.

Eventually the deployment succeeded and the CentOS 8 host was converted into a single node GlusterFS storage cluster:

# gluster volume status
Status of volume: data
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick loki.oasis.home:/gluster_bricks/data/
data                                        49153     0          Y       23806
 
Task Status of Volume data
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: engine
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick loki.oasis.home:/gluster_bricks/engin
e/engine                                    49152     0          Y       23527
 
Task Status of Volume engine
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: vmstore
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick loki.oasis.home:/gluster_bricks/vmsto
re/vmstore                                  49154     0          Y       24052
 
Task Status of Volume vmstore
------------------------------------------------------------------------------
There are no active volume tasks

Conclusion

I'm super pleased with the installation experience so far. The new GlusterFS Ansible roles had no issues setting up the bricks and volumes. The Cockpit Web-GUI was easy to use and always clearly communicated what was going on. There is now a supported configuration of a one node hyperconverged oVirt setup which makes me happy too. Kudos to everyone involved with this, great work!

Now let's continue with the oVirt self-hosted engine setup.