MTC openstack command failed when standby controller is down

Bug #1888546 reported by Yvonne Ding
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
chen haochuan

Bug Description

Brief Description
-----------------
Openstack command 'hypervisor list' is failed when standby controller is down. The failure applies to command 'server list' as well.

Severity
--------
Major

Steps to Reproduce
------------------
1. Reset the standby controller
2. Swact to the failed controller
3. Verify that the swact didn't occur

or
1. All agents are alive and up
2. Launch a vm
3. Reboot active controller node AND wait for reboot complete

TC-name:
test_swact_failed_controller_negative
or
test_system_persist_over_host_reboot

Expected Behavior
-----------------
openstack hypervisor list command is successful
or
openstack server list command is successful

Actual Behavior
----------------
openstack hypervisor list command is failed
or
openstack server list command is failed

Reproducibility
---------------
reproducible

System Configuration
--------------------
Regular standard 2+2

Lab-name:
wcp_7_10

Branch/Pull Time/Commit
-----------------------
BUILD_ID="r/stx.4.0"

Timestamp/Logs
--------------
[2020-07-22 15:16:29,805] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http ://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | disabled | offline |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2020-07-22 15:16:36,912] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'
[2020-07-22 15:16:38,559] 436 DEBUG MainThread ssh.expect :: Output:
 controller-0 is not enabled and has operational state disabled.Standby controller must be operationally enabled.

[2020-07-22 15:16:53,378] 314 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list'
[2020-07-22 15:17:04,528] 436 DEBUG MainThread ssh.expect :: Output:
Internal Server Error (HTTP 500)

[2020-07-22 15:27:16,194] 61 DEBUG MainThread conftest.update_results:: ***Failure at test teardown: /home/yding/cgcsAuto/cgcs/CGCSAuto/utils/cli.py:157: utils.exc eptions.CLIRejected: CLI command is rejected.
2565 ***Details: tp = <class 'utils.exceptions.CLIRejected'>, value = None, tb = None

Logs of .tar as below,
https://files.starlingx.kube.cengn.ca/launchpad/1888546

Test Activity
-------------
Regression

Yvonne Ding (yding)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Issue w/ openstack regression after the rebase to Ussuri; maybe related to the rebase. Marking as stx.4.0 gating for now until further investigation by the distro.openstack team.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.4.0 stx.distro.openstack
description: updated
Changed in starlingx:
assignee: nobody → yong hu (yhu6)
Yvonne Ding (yding)
description: updated
yong hu (yhu6)
Changed in starlingx:
assignee: yong hu (yhu6) → chen haochuan (martin1982)
Revision history for this message
chen haochuan (martin1982) wrote :

Uploaded log file doesn't match with bug description. Log file for date 20200720, ALL_NODES_20200720.091540, but issue description, happens at 20200722 and on such command history in bash.log

[2020-07-22 15:16:36,912] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.144.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'

Please ensure log file correct.

Will reproduce depends on bug description.

Revision history for this message
chen haochuan (martin1982) wrote :

issue reproduced

controller-0:~$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image list
Internal Server Error (HTTP 500)
controller-0:~$
controller-0:~$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list
Internal Server Error (HTTP 500)
controller-0:~$
controller-0:~$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne flavor list
Internal Server Error (HTTP 500)
controller-0:~$
controller-0:~$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list

Internal Server Error (HTTP 500)
controller-0:~$
controller-0:~$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne endpoint list

Internal Server Error (HTTP 500)
controller-0:~$
controller-0:~$
controller-0:~$ system --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://192.188.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | disabled | offline |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
controller-0:~$

Revision history for this message
chen haochuan (martin1982) wrote :

There is just one mariadb-server pod on standby controller work well. So if standby controller is down, no mariadb service, which keystone-api service doesn't work. Then "openstack server list" fail

controller-0:~$ kubectl get pods -n openstack | grep mariadb
mariadb-ingress-797fcf4f55-9pb4f 1/1 Running 0 44m
mariadb-ingress-797fcf4f55-wd58l 1/1 Running 0 15m
mariadb-ingress-error-pages-6755f56fbf-k4rrm 1/1 Running 0 32m
mariadb-server-0 1/1 Running 1 6m17s
mariadb-server-1 0/1 Running 5 35m
controller-0:~$ system --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://192.188.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
controller-0:~$

Revision history for this message
yong hu (yhu6) wrote :

a bit progress made today: the same test case could pass on Duplex, while it failed on multi-nodes deployment. The major diff is that "garbd" is enabled on multi-nodes while not there on duplex.

Need further debugging.

Revision history for this message
yong hu (yhu6) wrote :

suggest to continue addressing this LP in stx.4.0 maintenance releases.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)

Fix proposed to branch: master
Review: https://review.opendev.org/744486

Changed in starlingx:
status: Triaged → In Progress
Ghada Khalil (gkhalil)
tags: added: not-yet-in-r-stx40
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/744486
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=2f664927c4d990c2151677da5be2b0d0ac355121
Submitter: Zuul
Branch: master

commit 2f664927c4d990c2151677da5be2b0d0ac355121
Author: Martin, Chen <email address hidden>
Date: Mon Aug 3 21:40:02 2020 +0800

    Update mariadb-server suspect_timeout to default value to align
    with garbd's suspect_timeout

    In openstack-helm-infra, it launch evs.suspect_timeout=PT30S
    for mariadb-server in configmap, mariadb-etc. This setting is
    for three mariadb-server pod deployment, every mariadb-server
    with same setting suspect_timeout=30s. But after change to two
    mariadb-server and one garbd arbitrator. Setting in configmap
    mariadb-etc evs.suspect_timeout=PT30S, only takes effect for 2
    mariadb-server, for garbd arbitrator, it use galera default
    setting evs.suspect_timeout=PT5S. If mariadb-server-1 exit
    abnormal, after 5s, garbd arbitrator suspects mariadb-server-1
    is dead, but as not reach 30s, mariadb-server-0 thinks mariadb-server-1
    is not dead. In this state, quorum fail, garbd arbitrator and
    mariadb-server-0 both set to none primary component, service
    down.
    For fix solution, set value.conf.data.config_override to override
    wsrep_provider_option in mariadb helm chart, which makes garbd
    arbitrator and mariadb-server launch with same setting for
    "evs.suspect_timeout=PT5S", default value. By this way, mariadb
    server recovery time will also improve. To update setting for
    "evs.suspect_timeout", it should both update override for mariadb
    and garbd helm chart.

    Setting for "gmcast.listen_addr=tcp://0.0.0.0:<port>", takes
    effect for both ipv4 and ipv6. So keeps such setting.

    Reference link for wsrep option and galera cluster quorum
    https://mariadb.com/kb/en/wsrep_provider_options/
    https://galeracluster.com/library/documentation/weighted-quorum.html

    Closes-Bug: 1888546

    Change-Id: I06983cf0d91d4d9aa88f352e64b1e6571b816ec6
    Signed-off-by: Martin, Chen <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (r/stx.4.0)

Fix proposed to branch: r/stx.4.0
Review: https://review.opendev.org/747120

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/747124

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/747125

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (r/stx.4.0)

Reviewed: https://review.opendev.org/747120
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=4340d9e8e6f3e3e39afe167d1d21268923b65482
Submitter: Zuul
Branch: r/stx.4.0

commit 4340d9e8e6f3e3e39afe167d1d21268923b65482
Author: Martin, Chen <email address hidden>
Date: Thu Aug 20 16:11:02 2020 +0800

    Update mariadb-server suspect_timeout to default value to align
    with garbd's suspect_timeout

    In openstack-helm-infra, it launch evs.suspect_timeout=PT30S
    for mariadb-server in configmap, mariadb-etc. This setting is
    for three mariadb-server pod deployment, every mariadb-server
    with same setting suspect_timeout=30s. But after change to two
    mariadb-server and one garbd arbitrator. Setting in configmap
    mariadb-etc evs.suspect_timeout=PT30S, only takes effect for 2
    mariadb-server, for garbd arbitrator, it use galera default
    setting evs.suspect_timeout=PT5S. If mariadb-server-1 exit
    abnormal, after 5s, garbd arbitrator suspects mariadb-server-1
    is dead, but as not reach 30s, mariadb-server-0 thinks mariadb-server-1
    is not dead. In this state, quorum fail, garbd arbitrator and
    mariadb-server-0 both set to none primary component, service
    down.
    For fix solution, set value.conf.data.config_override to override
    wsrep_provider_option in mariadb helm chart, which makes garbd
    arbitrator and mariadb-server launch with same setting for
    "evs.suspect_timeout=PT5S", default value. By this way, mariadb
    server recovery time will also improve. To update setting for
    "evs.suspect_timeout", it should both update override for mariadb
    and garbd helm chart.

    Setting for "gmcast.listen_addr=tcp://0.0.0.0:<port>", takes
    effect for both ipv4 and ipv6. So keeps such setting.

    Reference link for wsrep option and galera cluster quorum
    https://mariadb.com/kb/en/wsrep_provider_options/
    https://galeracluster.com/library/documentation/weighted-quorum.html

    Closes-Bug: 1888546

    Change-Id: Ie26fd33488616b13bc16102dd92d1be2143c82ce
    Signed-off-by: Martin, Chen <email address hidden>

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Adding the stx.3.0 label since the dev prime is proposing to fix this in stx.3.0 as well

tags: added: stx.3.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (r/stx.3.0)

Reviewed: https://review.opendev.org/747124
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=245023894acfb163b4ed73ccded72914550d982c
Submitter: Zuul
Branch: r/stx.3.0

commit 245023894acfb163b4ed73ccded72914550d982c
Author: Martin, Chen <email address hidden>
Date: Thu Aug 20 16:26:50 2020 +0800

    Update mariadb-server suspect_timeout to default value to align
    with garbd's suspect_timeout

    In openstack-helm-infra, it launch evs.suspect_timeout=PT30S
    for mariadb-server in configmap, mariadb-etc. This setting is
    for three mariadb-server pod deployment, every mariadb-server
    with same setting suspect_timeout=30s. But after change to two
    mariadb-server and one garbd arbitrator. Setting in configmap
    mariadb-etc evs.suspect_timeout=PT30S, only takes effect for 2
    mariadb-server, for garbd arbitrator, it use galera default
    setting evs.suspect_timeout=PT5S. If mariadb-server-1 exit
    abnormal, after 5s, garbd arbitrator suspects mariadb-server-1
    is dead, but as not reach 30s, mariadb-server-0 thinks mariadb-server-1
    is not dead. In this state, quorum fail, garbd arbitrator and
    mariadb-server-0 both set to none primary component, service
    down.
    For fix solution, set value.conf.data.config_override to override
    wsrep_provider_option in mariadb helm chart, which makes garbd
    arbitrator and mariadb-server launch with same setting for
    "evs.suspect_timeout=PT5S", default value. By this way, mariadb
    server recovery time will also improve. To update setting for
    "evs.suspect_timeout", it should both update override for mariadb
    and garbd helm chart.

    Setting for "gmcast.listen_addr=tcp://0.0.0.0:<port>", takes
    effect for both ipv4 and ipv6. So keeps such setting.

    Reference link for wsrep option and galera cluster quorum
    https://mariadb.com/kb/en/wsrep_provider_options/
    https://galeracluster.com/library/documentation/weighted-quorum.html

    Closes-Bug: 1888546

    Change-Id: I92af77fab929c9f598b7dc41543db6ad6238f812
    Signed-off-by: Martin, Chen <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (r/stx.3.0)

Reviewed: https://review.opendev.org/747125
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=ef4745c7fd583a1a44e8dcb2631a7634150b44b6
Submitter: Zuul
Branch: r/stx.3.0

commit ef4745c7fd583a1a44e8dcb2631a7634150b44b6
Author: Martin, Chen <email address hidden>
Date: Thu Aug 20 16:33:26 2020 +0800

    Update mariadb-server suspect_timeout to default value to align
    with garbd's suspect_timeout

    In openstack-helm-infra, it launch evs.suspect_timeout=PT30S
    for mariadb-server in configmap, mariadb-etc. This setting is
    for three mariadb-server pod deployment, every mariadb-server
    with same setting suspect_timeout=30s. But after change to two
    mariadb-server and one garbd arbitrator. Setting in configmap
    mariadb-etc evs.suspect_timeout=PT30S, only takes effect for 2
    mariadb-server, for garbd arbitrator, it use galera default
    setting evs.suspect_timeout=PT5S. If mariadb-server-1 exit
    abnormal, after 5s, garbd arbitrator suspects mariadb-server-1
    is dead, but as not reach 30s, mariadb-server-0 thinks mariadb-server-1
    is not dead. In this state, quorum fail, garbd arbitrator and
    mariadb-server-0 both set to none primary component, service
    down.
    For fix solution, set value.conf.data.config_override to override
    wsrep_provider_option in mariadb helm chart, which makes garbd
    arbitrator and mariadb-server launch with same setting for
    "evs.suspect_timeout=PT5S", default value. By this way, mariadb
    server recovery time will also improve. To update setting for
    "evs.suspect_timeout", it should both update override for mariadb
    and garbd helm chart.

    Setting for "gmcast.listen_addr=tcp://0.0.0.0:<port>", takes
    effect for both ipv4 and ipv6. So keeps such setting.

    Reference link for wsrep option and galera cluster quorum
    https://mariadb.com/kb/en/wsrep_provider_options/
    https://galeracluster.com/library/documentation/weighted-quorum.html

    Closes-Bug: 1888546

    Depends-on: https://review.opendev.org/#/c/747093/

    Change-Id: Ie648745b5670fffb3a513a312e0b085cfea4544a
    Signed-off-by: Martin, Chen <email address hidden>

Ghada Khalil (gkhalil)
tags: added: in-r-stx30
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.