Build overcloud image for rhel8 fails sometimes on in_target.d/post-install.d/51-enable-network-service

Bug #1853028 reported by Sagi (Sergey) Shnaidman
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

In almost half of cases building overcloud qcow2 image fails for RHEL-8 on step:
2019-11-18 12:31:42.708 | dib-run-parts Running /tmp/in_target.d/post-install.d/51-enable-network-service
2019-11-18 12:31:42.710 | + set -o pipefail
2019-11-18 12:31:42.710 | + chkconfig network on
2019-11-18 12:31:42.712 | failed to glob pattern /etc/rc0.d/[SK][0-9][0-9]network: No such file or directory

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-rhel-8-buildimage-overcloud-full-master/813843a/build.log

When it passes, this element is completed, for example:
2019-11-18 00:30:36.606 | dib-run-parts Running /tmp/in_target.d/post-install.d/51-enable-network-service
2019-11-18 00:30:36.608 | + set -o pipefail
2019-11-18 00:30:36.608 | + chkconfig network on
2019-11-18 00:30:36.611 | dib-run-parts 51-enable-network-service completed

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-rhel-8-buildimage-overcloud-full-master/3fe0b42/build.log

Revision history for this message
Alex Schultz (alex-schultz) wrote :

this was the error previously when we were using systemd, so same problem i guess

tags: added: promotion-blocker
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

o/ I am ruck and came here from a new/untriaged bug on prod chain but checking there https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-rhel-8-buildimage-overcloud-full-master it is green with last fail on 19th.

Do we have recent examples of this? Otherwise why did we add promotion-blocker flag to it yesterday

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

i think this is a duplicate for https://bugs.launchpad.net/tripleo/+bug/1851274 or at least it is definitely related to it.

The error in that bug was meant to be fixed by using chkconfig instead of systemctl enable https://opendev.org/openstack/tripleo-puppet-elements/commit/63859da17b4f4e25b4817950544a5daf50270e8a

However we still have the same error appearing ... trace now is like:

2019-11-26 01:55:15.886 | dib-run-parts Running /tmp/in_target.d/post-install.d/51-enable-network-service
2019-11-26 01:55:15.888 | + set -o pipefail
2019-11-26 01:55:15.888 | + chkconfig network on
2019-11-26 01:55:15.889 | failed to glob pattern /etc/rc0.d/[SK][0-9][0-9]network: No such file or directory

(before it was the same but with systemctl enable instead of chkconfig see https://bugs.launchpad.net/tripleo/+bug/1851274 )

Revision history for this message
chandan kumar (chkumar246) wrote :

We are seeing the same error in fs01 train rhel8 job http://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-train/45a1c67/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2019-12-08 18:28:37 | MSG:
2019-12-08 18:28:37 |
2019-12-08 18:28:37 | Unable to enable service network: network.service is not a native service, redirecting to systemd-sysv-install.
2019-12-08 18:28:37 | Executing: /usr/lib/systemd/systemd-sysv-install enable network
2019-12-08 18:28:37 | failed to glob pattern /etc/rc0.d/[SK][0-9][0-9]network: No such file or directory
2019-12-08 18:28:37 |
2019-12-08 18:28:37 | fatal: [overcloud-novacompute-0]: FAILED! => {
2019-12-08 18:28:37 | "changed": false
2019-12-08 18:28:37 | }
2019-12-08 18:28:37 |

Re-running using rdo-jobs here https://review.rdoproject.org/r/#/c/24027/

Revision history for this message
chandan kumar (chkumar246) wrote :

@marios, it seems to be a real issue and filed a new bug https://bugs.launchpad.net/tripleo/+bug/1855706

Revision history for this message
chandan kumar (chkumar246) wrote :

As in fs01 train job, we consume the overcloud image built in a seperate job.

Revision history for this message
chandan kumar (chkumar246) wrote :

@marios, Looking deep into the bug, it appears fs01 train RHEL8 is hitting the same issue. and I have closed the above bug as duplicate.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

seems like a rhel bug somewhere

Revision history for this message
chandan kumar (chkumar246) wrote :

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-rhel-8-buildimage-overcloud-full-master/b7e6593/build.log

From this 2019-12-12 00:30:29.267 | > Installing : network-scripts-10.00.4-1.el8.x86_64 13/168
2019-12-12 00:30:29.267 | > Running scriptlet: network-scripts-10.00.4-1.el8.x86_64 13/168
2019-12-12 00:30:29.267 | > failed to glob pattern /etc/rc0.d/[SK][0-9][0-9]network: No such file or directory

It is a problem related to network-scripts package itself.

Revision history for this message
chandan kumar (chkumar246) wrote :

network-scripts comes from initscript
Available Packages
Name : network-scripts
Version : 10.02
Release : 2.fc31
Architecture : x86_64
Size : 62 k
Source : initscripts-10.02-2.fc31.src.rpm
Repository : fedora
Summary : Legacy scripts for manipulating of network devices
URL : https://github.com/fedora-sysv/initscripts
License : GPLv2
Description : This package contains the legacy scripts for activating & deactivating of most
             : network interfaces. It also provides a legacy version of 'network' service.
             :
             : The 'network' service is enabled by default after installation of this package,
             : and if the network-scripts are installed alongside NetworkManager, then the
             : ifup/ifdown commands from network-scripts take precedence over the ones provided
             : by NetworkManager.
             :
             : If user has both network-scripts & NetworkManager installed, and wishes to
             : use ifup/ifdown from NetworkManager primarily, then they has to run command:
             : $ update-alternatives --config ifup
             :
             : Please note that running the command above will also disable the 'network'
             : service.

Revision history for this message
Marios Andreou (marios-b) wrote :

thanks to chkumar++ this looks to be selinux related. We have a green run in https://review.rdoproject.org/r/#/c/23919 though it didn't report yet it will in a bit

But that is a hack (https://review.opendev.org/#/c/698883/) just to prove its selinux. We need to work out how to make it more permanent

Revision history for this message
Alex Schultz (alex-schultz) wrote :

For historical sake, here is an analysis of the error that I did:

This seems to be an issue with the internals of chkconfig. We probably want to open a BZ but without a specific reproducer this might be hard to get addressed.

Here's where the error message from the logs is being printed:
https://github.com/fedora-sysv/chkconfig/blob/1.11/leveldb.c#L758-L786

Unfortunately chkconfig doesn't exactly give great details on what it's doing. In theory when we're running this command initially there won't be any /etc/rc0.d/[SK][0-9][0-9]network files as it needs to be created by this call.

(for the record /usr/lib/systemd/systemd-sysv-install is a symlink to chkconfig)

Here is what I think the path through chkconfig via systemd-sysv-install looks like

Here's where the state is set to 'on' in main
https://github.com/fedora-sysv/chkconfig/blob/1.11/chkconfig.c#L824-L844

setService is run here:
https://github.com/fedora-sysv/chkconfig/blob/1.11/chkconfig.c#L895

It parses the /etc/rc.d/init.d/network file here:
https://github.com/fedora-sysv/chkconfig/blob/1.11/chkconfig.c#L608
https://github.com/fedora-sysv/chkconfig/blob/1.11/leveldb.c#L357-L416

Then it starts looping through the levels to set the service:
https://github.com/fedora-sysv/chkconfig/blob/1.11/chkconfig.c#L635-L648

Part of doSetService is to find the service entries:
https://github.com/fedora-sysv/chkconfig/blob/1.11/leveldb.c#L913

I think this is where the code is exercised that prints this error. In theory it would return 1 and the calling function would interpret this as (True, because c). The calling function would try and do the symlinks and would only return 1 after printing a 'failed to make symlink' message which we don't see.

https://github.com/fedora-sysv/chkconfig/blob/1.11/leveldb.c#L925-L931

So I'm not seeing where doSetService(...) would return 1 here
https://github.com/fedora-sysv/chkconfig/blob/1.11/chkconfig.c#L648-L653

If that returned 1, then the program would exit with 1
https://github.com/fedora-sysv/chkconfig/blob/1.11/chkconfig.c#L895

One possible solution is to inject a chkconfig package with more verbose information so we could track down what's happening. That's probably the best way to try and continue to troubleshoot this issue.

Revision history for this message
Marios Andreou (marios-b) wrote :

update before i go..

we discussed with chkumar and weshay on a call today

weshay checked the rhel8 guest image and confirmed it has enforcing for selinux.

The 8.1 image is permissive, however there was a problem with rhui/hosts

I downloaded and used virt-customize to update the image attaching log here from final debug.

For the record i did those:

        * [m@192 rhel8.1guest]$ virt-customize -v -x -a rhel-8.1-x86_64-kvm.qcow2 --run-command "yum remove -y rdo-rhui-2.2-1"

        * [m@192 rhel8.1guest]$ virt-customize -v -x -a rhel-8.1-x86_64-kvm.qcow2 --run-command "yum install -y http://file.rdu.redhat.com/~apevec/OSP/rdo-rhui-2.2-1.noarch.rpm"

        * [m@192 rhel8.1guest]$ virt-customize -v -x -a rhel-8.1-x86_64-kvm.qcow2 --run-command "echo '38.145.32.241 rhui-cds' >> /etc/hosts"

Then we switch to the new image with "Switch to rhel8.1 guest image for rhel image build base job " https://review.rdoproject.org/r/#/c/24197/

Let's see how it goes with recheck @ https://review.rdoproject.org/r/#/c/23919/

revisit tomorrow

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Chandan opened a bug in BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1784001
bugzilla.redhat.com bug 1784001 in chkconfig "chkconfig network on returing failed to glob pattern /etc/rc0.d/[SK][0-9][0-9]network" [Urgent,New] - Assigned to lnykryn

I've updated it with needed info - strace, logs, etc. It's unlikely selinux issue from what it seems atm.

Revision history for this message
Marios Andreou (marios-b) wrote :

So with updated guest image & now that we switched to it with https://review.rdoproject.org/r/#/c/24197/ this looks to be solved.

Got 4 green runs in a row @ https://review.rdoproject.org/r/#/c/23919/

Current openstack-periodic-master pipeline run is also green

        * http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-rhel-8-buildimage-overcloud-full-master/351b9a8/
        * http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-rhel-8-buildimage-ironic-python-agent-master/3892976/

marking fix-released please move back if you disagree thanks

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/699170
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=17282f8d60c2bc3e2336b473b69b5046fa8a8b29
Submitter: Zuul
Branch: master

commit 17282f8d60c2bc3e2336b473b69b5046fa8a8b29
Author: Chandan Kumar (raukadah) <email address hidden>
Date: Mon Dec 16 14:05:38 2019 +0530

    Use osp_release flag to set selinux mode

    In Upstream and rdo-cloud tripleo ci jobs on RHEL & CentOS, we
    use selinux mode to permissive but currently it is harded for
    CentOS only.

    In Downstream jobs, we use enforcing mode. So instead of depending
    upon ansible_distribution, we can rely on osp_release to toggle
    selinux mode and will work for both centOS and RHEL.

    Related-Bug: 1853028

    Change-Id: I6a6449777ea28198002b8c028a345ab16b733901
    Signed-off-by: Chandan Kumar (raukadah) <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.