tripleo

periodic centos8 Ussuri multinode minor update job fails: tderr": "Error: resource 'ip-192.168.24.16' is not running on any node

Bug #1890798 reported by Marios Andreou on 2020-08-07

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Unassigned	tripleo victoria-3 "tripleo victoria"

Bug Description

At [1][2][3] the periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussurijob fails during the Controller update with trace like:

* 2020-08-05 11:06:29 | <192.168.24.3> (0, b'\n{"cmd": "CLUSTER_NODE=$(crm_node -n)\\necho \\"Retrieving all the VIPs which are hosted on this node\\"\\nVIPS_TO_MOVE=$(crm_mon --as-xml | xmllint --xpath \'//resource[@resource_agent = \\"ocf::heartbeat:IPaddr2\\" and @role = \\"Started\\" and @managed = \\"true\\" and ./node[@name = \\"\'${CLUSTER_NODE}\'\\"]]/@id\' - | sed -e \'s/id=//g\' -e \'s/\\"//g\')\\nfor v in ${VIPS_TO_MOVE}; do\\n echo \\"Moving VIP $v on another node\\"\\n pcs resource move $v --wait=300\\ndone\\necho \\"Removing the location constraints that were created to move the VIPs\\"\\nfor v in ${VIPS_TO_MOVE}; do\\n echo \\"Removing location ban for VIP $v\\"\\n ban_id=$(cibadmin --query | xmllint --xpath \'string(//rsc_location[@rsc=\\"\'${v}\'\\" and @node=\\"\'${CLUSTER_NODE}\'\\" and @score=\\"-INFINITY\\"]/@id)\' -)\\n if [ -n \\"$ban_id\\" ]; then\\n pcs constraint remove ${ban_id}\\n else\\n echo \\"Could not retrieve and clear location constraint for VIP $v\\" 2>&1\\n fi\\ndone\\n", "stdout": "Retrieving all the VIPs which are hosted on this node\\nMoving VIP ip-192.168.24.16 on another node\\nWarning: Creating location constraint \'cli-ban-ip-192.168.24.16-on-node-0000333831\' with a score of -INFINITY for resource ip-192.168.24.16 on node-0000333831.\\n\\tThis will prevent ip-192.168.24.16 from running on node-0000333831 until the constraint is removed\\n\\tThis will be the case even if node-0000333831 is the last node in the cluster\\nRemoving the location constraints that were created to move the VIPs\\nRemoving location ban for VIP ip-192.168.24.16", "stderr": "Error: resource \'ip-192.168.24.16\' is not running on any node", "rc": 0, "start": "2020-08-05 11:06:25.880027", "end": "2020-08-05 11:06:29.046251", "delta": "0:00:03.166224", "changed": true, "invocation": {"module_args": {"_raw_params": "CLUSTER_NODE=$(crm_node -n)\\necho \\"Retrieving all the VIPs which are hosted on this node\\"\\nVIPS_TO_MOVE=$(crm_mon --as-xml | xmllint --xpath \'//resource[@resource_agent = \\"ocf::heartbeat:IPaddr2\\" and @role = \\"Started\\" and @managed = \\"true\\" and ./node[@name = \\"\'${CLUSTER_NODE}\'\\"]]/@id\' - | sed -e \'s/id=//g\' -e \'s/\\"//g\')\\nfor v in ${VIPS_TO_MOVE}; do\\n echo \\"Moving VIP $v on another node\\"\\n pcs resource move $v --wait=300\\ndone\\necho \\"Removing the location constraints that were created to move the VIPs\\"\\nfor v in ${VIPS_TO_MOVE}; do\\n echo \\"Removing location ban for VIP $v\\"\\n ban_id=$(cibadmin --query | xmllint --xpath \'string(//rsc_location[@rsc=\\"\'${v}\'\\" and @node=\\"\'${CLUSTER_NODE}\'\\" and @score=\\"-INFINITY\\"]/@id)\' -)\\n if [ -n \\"$ban_id\\" ]; then\\n pcs constraint remove ${ban_id}\\n else\\n echo \\"Could not retrieve and clear location constraint for VIP $v\\" 2>&1\\n fi\\ndone\\n", "_uses_shell": true, "warn": true, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "executable": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')
* 2020-08-05 11:06:29 | "stderr": "Error: resource 'ip-192.168.24.16' is not running on any node",

This does NOT happen in gates luckily appears to be periodic only gates green [4]

[1] https://logserver.rdoproject.org/90/28890/1/check/periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri/019fa3b/logs/undercloud/home/zuul/overcloud_update_run_Controller.log.txt.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri/dba0e70/logs/undercloud/home/zuul/overcloud_update_run_Controller.log.txt.gz
[3] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri/1442873/logs/undercloud/home/zuul/overcloud_update_run_Controller.log.txt.gz
[4] https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri&pipeline=gate

Tags:

Revision history for this message

chandan kumar (chkumar246) wrote on 2020-08-07:

By looking at the bottom of the logs:https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri/1442873/logs/undercloud/home/zuul/overcloud_update_run_Controller.log.txt.gz

["mysql_init_bundle", "rabbitmq_init_bundle"], "executable": "podman"}}}\n', b'')
2020-08-05 11:28:12 | [ERROR]: Container(s) which failed to be created by podman_container module:
2020-08-05 11:28:12 | ['mysql_init_bundle']
2020-08-05 11:28:12 | [ERROR]: Container(s) which did not finish after 300 minutes:
2020-08-05 11:28:12 | ['mysql_init_bundle']
2020-08-05 11:28:12 | 2020-08-05 11:28:12.188714 | fa163ec9-c15d-a875-5d29-000000001fb7 | FATAL | Check containers status | node-0000333831 | error={
2020-08-05 11:28:12 | "changed": false,
2020-08-05 11:28:12 | "msg": "Failed container(s): ['mysql_init_bundle'], check logs in /var/log/containers/stdouts/"

Under https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri/1442873/logs/subnode-1/var/log/extra/podman/containers/mysql_init_bundle/stdout.log.txt.gz there is no error

and https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri/1442873/logs/subnode-1/var/log/extra/podman/containers/rabbitmq_init_bundle/stdout.log.txt.gz

Error: Facter: error while resolving custom fact "rabbitmq_nodename": undefined method `[]' for nil:NilClass not sure it is related

Revision history for this message

Damien Ciabrini (dciabrin) wrote on 2020-08-07:

The "undefined method" is usually a red herring, I believe Michele fixed it in Master, probably not backported yet.

The fact that mysq_init_bundle did not finished after 300minutes is usually indicative that it waited indefinitely for rabbitmq to start. I'm going to check the cluster logs to find out why.

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-08-11:

this still blocks the Ussuri promotion to current-tripleo latest logs from yesterday there

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri/653660a/logs/undercloud/home/zuul/overcloud_update_run_Controller.log.txt.gz

2020-08-10 22:59:57 | "stderr": "Error: resource 'ip-192.168.24.10' is not running on any node",

Revision history for this message

Damien Ciabrini (dciabrin) wrote on 2020-08-12:

Two notes to clarify the failure messages seen in this launchpad:

1. the "Error: resource \'ip-192.168.24.10\' not running on any node is caused by a task to migrates away the VIP hosted by the node being updated, to avoid a potential long service disruption caused by the stop ordering constraint in pacemaker.

We migrate the VIP away by banning the VIP locally, and lifting the ban once it's restarted elsewhere. However this assumes that the control plane is bigger than 1 node, so we need to fix that eventually.

2. here the fatal error during the update task seems to be that mysql_init_bundle failed to access mysql and made the update timed out.
When looking at mysql logs [1], I can see that mysql stopped an unexpected file permission error:

2020-08-10 23:06:01 0 [ERROR] InnoDB: Cannot open '/var/lib/mysql/ib_buffer_pool.incomplete' for writing: Permission denied
2020-08-10 23:06:02 0 [Note] InnoDB: Shutdown completed; log sequence number 4053481; transaction id 2693
2020-08-10 23:06:02 0 [ERROR] InnoDB: Operating system error number 13 in a file operation.
2020-08-10 23:06:02 0 [ERROR] InnoDB: The error means mysqld does not have the access rights to the directory.
2020-08-10 23:06:02 0 [ERROR] mysqld: Error on delete of './tc.log' (Errcode: 13 "Permission denied")
2020-08-10 23:06:02 0 [Note] /usr/libexec/mysqld: Shutdown complete

That is due to a recent change in upstream master now that kolla images are build with TCIB instead of kolla [2]. The UID/GID of service user has changed, so when this minor update job updates the container image to ussuri:current-tripleo [3], the id is changed back to kolla, and the container cannot restart.

Until https://review.opendev.org/#/c/745575/ is merged, this periodic job won't pass unless the initial container image used to deploy the overcloud changes.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri/653660a/logs/subnode-1/var/log/containers/mysql/mysqld.log.txt.gz
[2] trunk.registry.rdoproject.org/tripleoussuri/centos-binary-mariadb:bb7c0317ff07c99e4bb86814b2beecc9
[3] docker.io/tripleou/centos-binary-mariadb:current-tripleo

Two notes to clarify the failure messages seen in this launchpad:

Until https://review.opendev.org/#/c/745575/ is merged, this periodic job won't pass unless the initial container image used to deploy the overcloud changes.

Revision history for this message

chandan kumar (chkumar246) wrote on 2020-08-17:

https://review.opendev.org/#/c/745575/ is proposed against master, but the failing job runs against stable/ussuri, Do we want to backport this patch to stable/ussuri?

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-08-17:

@chandan most certainly but the master patch didn't merge yet

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-17: Related fix proposed to tripleo-common (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/746533

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-17: Related fix proposed to tripleo-common (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/746540

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-19: Related fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/745575
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=fb0ec2ffd0a99d441c361bb83472e48ead18d013
Submitter: Zuul
Branch: master

commit fb0ec2ffd0a99d441c361bb83472e48ead18d013
Author: Emilien Macchi <email address hidden>
Date: Mon Aug 10 14:12:53 2020 -0400

(backward compatibility) create kolla UIDs/GIDs for TCIB

    When upgrading container images from Kolla-based to TCIB-based, the
    config files can't be read if we don't apply the new chown values unless
    we re-use the same UIDs/GIDs as we had with Kolla; which is what we'll
    do for now until we figure out an easy way to update config permissions
    during the upgrade/update stage.

This introduces a new script, that will be injected in the base layer:

uid_gid_manage.sh

Usage:

./uid_gid_manage.sh qemu nova

It'll first create the qemu user/group, then nova user/group.
The format aims to be simple:

    It's also removing instances of usermod which aren't needed anymore;
    since we create the user / groups only in the images where they are
    needed; which wasn't the case for Kolla where all users/groups were
    created in the base image.

    Note: we create the user/group before installing packaging so if
    packaging creates directories and sets permissions, it'll be with the
    right UID/GID (unless packaging overrides UID/GID but rarely do it).

Related-Bug: #1890798

Change-Id: If3fa2ff34af42a7438c6dbf81dbcb0bddd63afa6

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-08-19:

#10

still waiting for the patch to make it's way through gates on branches to train (ussuri is the one we are interested in for this particular bug)

master (merged) https://review.opendev.org/#/c/745575/
ussuri https://review.opendev.org/#/c/746533/
train https://review.opendev.org/#/c/746540/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-20: Related fix merged to tripleo-common (stable/ussuri)

#11

Reviewed: https://review.opendev.org/746533
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=70b1ca4b1cee3813ee49cd71d8e16b9337ad11fd
Submitter: Zuul
Branch: stable/ussuri

commit 70b1ca4b1cee3813ee49cd71d8e16b9337ad11fd
Author: Emilien Macchi <email address hidden>
Date: Mon Aug 10 14:12:53 2020 -0400

(backward compatibility) create kolla UIDs/GIDs for TCIB

This introduces a new script, that will be injected in the base layer:

uid_gid_manage.sh

Usage:

./uid_gid_manage.sh qemu nova

It'll first create the qemu user/group, then nova user/group.
The format aims to be simple:

Related-Bug: #1890798

Change-Id: If3fa2ff34af42a7438c6dbf81dbcb0bddd63afa6
(cherry picked from commit fb0ec2ffd0a99d441c361bb83472e48ead18d013)

tags:

added: in-stable-ussuri

Revision history for this message

chandan kumar (chkumar246) wrote on 2020-08-20:

#12

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-ussuri/6e2edd1/ has passed now. We will monitor it for few days. Then we can mark it to fix released.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-21: Related fix merged to tripleo-common (stable/train)

#13

Reviewed: https://review.opendev.org/746540
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=5397e9eb4fcf2c95f50049995d433bca3ea3dcf9
Submitter: Zuul
Branch: stable/train

commit 5397e9eb4fcf2c95f50049995d433bca3ea3dcf9
Author: Emilien Macchi <email address hidden>
Date: Mon Aug 10 14:12:53 2020 -0400

(backward compatibility) create kolla UIDs/GIDs for TCIB

This introduces a new script, that will be injected in the base layer:

uid_gid_manage.sh

Usage:

./uid_gid_manage.sh qemu nova

It'll first create the qemu user/group, then nova user/group.
The format aims to be simple:

Related-Bug: #1890798

Change-Id: If3fa2ff34af42a7438c6dbf81dbcb0bddd63afa6
(cherry picked from commit fb0ec2ffd0a99d441c361bb83472e48ead18d013)

tags:

added: in-stable-train

Revision history for this message

Damien Ciabrini (dciabrin) wrote on 2020-08-22:

#14

As an update to my comment #4,

> We migrate the VIP away by banning the VIP locally, and lifting the ban once it's restarted
> elsewhere. However this assumes that the control plane is bigger than 1 node, so we need to fix
> that eventually.

For reference, this issue is now being tracked in its own launchpad [1], with a proposed fix [2] available for review.

[1] https://bugs.launchpad.net/tripleo/+bug/1892570
[2] https://review.opendev.org/#/c/746927/2