(intermittent?) cli-enable-ssh-admin.yaml fails during the overcloud deploy

Bug #1863920 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sagi (Sergey) Shnaidman

Bug Description

in periodic [1] and check at [2] the overcloud deployment is failing during the
Run Create admin play with trace like:

        2020-02-18 21:18:48 | fatal: [192.168.24.17]: UNREACHABLE! => changed=false
        2020-02-18 21:18:48 | msg: |-
        2020-02-18 21:18:48 | Data could not be sent to remote host "192.168.24.17". Make sure this host can be reached over ssh: Warning: Permanently added '192.168.24.17' (ECDSA) to the list of known hosts.
        2020-02-18 21:18:48 | heat-admin@192.168.24.17: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
        2020-02-18 21:18:48 | unreachable: true
        2020-02-18 21:18:48 |
...
        2020-02-18 21:18:49 | ansible_runner.exceptions.AnsibleRunnerException: Ansible execution failed. playbook: /usr/share/ansible/tripleo-playbooks/cli-enable-ssh-admin.yaml, Run Status: failed, Return Code: 2
        2020-02-18 21:18:49 | Ansible execution failed. playbook: /usr/share/ansible/tripleo-playbooks/cli-enable-ssh-admin.yaml, Run Status: failed, Return Code: 2
        2020-02-18 21:18:49 | sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 53906), raddr=('192.168.24.2', 13808)>

This does not seem to be a consistent error but seen it at least twice now so we need a bug

[1] https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/441b0b8/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
[2] https://logserver.rdoproject.org/20/708620/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/a775981/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Trying to debug this with a ansible debug here: https://review.opendev.org/#/c/708620/

Revision history for this message
Alex Schultz (alex-schultz) wrote :

It's likely a race condition with cloud-init. Prior to ansible, we'd wait until the heat service came up to fetch the ssh key bits. With ansible it can try and connect to an up host prior to the user being created by cloud-init.

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/708717

Changed in tripleo:
assignee: nobody → Alex Schultz (alex-schultz)
status: Triaged → In Progress
Changed in tripleo:
assignee: Alex Schultz (alex-schultz) → Kevin Carter (kevin-carter)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/708781

Changed in tripleo:
assignee: Kevin Carter (kevin-carter) → Sagi (Sergey) Shnaidman (sshnaidm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ansible (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/708717
Reason: https://review.opendev.org/#/c/708781

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (master)

Reviewed: https://review.opendev.org/708781
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=af749d1a6a6691bf5bf6f69ad7beede78df0379c
Submitter: Zuul
Branch: master

commit af749d1a6a6691bf5bf6f69ad7beede78df0379c
Author: Kevin Carter <email address hidden>
Date: Wed Feb 19 20:15:22 2020 -0600

    Use correct default key file and normalize the usage

    The provision command was defaulting to id_rsa.pub, however the deploy
    command uses id_rsa_tripleo for initial setup.

    When using the deploy command for provision as well, use the public
    key, not the private id_rsa_tripleo.

    This option was being processed in several different ways, this change
    normalize it by creating a single function in the Command class, which
    all inheriting methods will consume. Tests have been updated to
    accomodate this change.

    Related-Bug: #1863920
    Change-Id: I221480f3cfc77545a8fcbef777829239c3bad0a0
    Signed-off-by: Kevin Carter <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/709026

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/709027

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on python-tripleoclient (stable/train)

Change abandoned by Kevin Carter (cloudnull) (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/709026
Reason: bad backport

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/707658
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=655157e444e6e9148bb2e4ffbe9259cecd4615a2
Submitter: Zuul
Branch: master

commit 655157e444e6e9148bb2e4ffbe9259cecd4615a2
Author: Kevin Carter <email address hidden>
Date: Thu Feb 13 08:34:15 2020 -0600

    Improve execution and add a port check

    This change introduces a fact check for neutron ports which will allow us
    to pull a list of used IP addresses from our known port list which is
    contains fixed_addresses. This port list will then be used to determine the
    default ssh_user when first running a deployment. By pulling the neutron
    facts and using the information to dictact the access user we'll be able
    to support both pre-provisioned nodes and ironic provisioned nodes at the
    same time within the same playbook.

    The cli-enable-ssh-admin.yaml makes several API intensive calls to heat
    and neutron, so to speed things up we're using ansible async. This change
    moves our api intensive calls to the top of the playbooks and blocks on
    their completion before they're needed. By doing this we'll improve the
    overall playbook execution time.

    Closes-Bug: #1863920
    Change-Id: Ib79747ee7212534ab7c58d8a3e0e1d33f6069485
    Depends-On: I221480f3cfc77545a8fcbef777829239c3bad0a0
    Signed-off-by: Kevin Carter <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (stable/train)

Reviewed: https://review.opendev.org/709027
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=b5bc064b0c333c8c333670c98216a89aae1dc5de
Submitter: Zuul
Branch: stable/train

commit b5bc064b0c333c8c333670c98216a89aae1dc5de
Author: Kevin Carter <email address hidden>
Date: Wed Feb 19 20:15:22 2020 -0600

    Use correct default key file and normalize the usage

    The provision command was defaulting to id_rsa.pub, however the deploy
    command uses id_rsa_tripleo for initial setup.

    When using the deploy command for provision as well, use the public
    key, not the private id_rsa_tripleo.

    This option was being processed in several different ways, this change
    normalize it by creating a single function in the Command class, which
    all inheriting methods will consume. Tests have been updated to
    accomodate this change.

    Related-Bug: #1863920
    Change-Id: Ib4ee480a99c0388c526e4a90a8a1db7d1747276a
    Signed-off-by: Kevin Carter <email address hidden>
    (cherry picked from commit af749d1a6a6691bf5bf6f69ad7beede78df0379c)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 1.3.0

This issue was fixed in the openstack/tripleo-ansible 1.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.