tripleo-ci jobs failing intermittently with "Image prepare failed: 401 Client Error: Unauthorized"

Bug #1837388 reported by Marios Andreou
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Triaged
High
Marios Andreou

Bug Description

The tripleo-ci gate is failing intermittently during the standalone/or undercloud (depending on job) deployment with a client authentication error running container image prepare - trace like:

        2019-07-22 09:36:32.724 17117 ERROR root [ ] Image prepare failed: 401 Client Error: Unauthorized for url: http://mirror.bhs1.ovh.openstack.org:8082/v2/tripleomaster/centos-binary-nova-compute-ironic/blobs/sha256:6d3a23ca3a1378376ca4268c06d7c7da7b25358e69ff389475e5a30b78549fbb
        Traceback (most recent call last):
          File "/usr/bin/tripleo-container-image-prepare", line 132, in <module>
            env, roles_data, cleanup=args.cleanup, dry_run=args.dry_run)
        ...
          File "/usr/lib/python2.7/site-packages/tripleo_common/image/image_uploader.py", line 1257, in _copy_registry_to_registry
            r.raise_for_status()
          File "/usr/lib/python2.7/site-packages/requests/models.py", line 940, in raise_for_status
            raise HTTPError(http_error_msg, response=self)
        HTTPError: 401 Client Error: Unauthorized for url: http://mirror.bhs1.ovh.openstack.org:8082/v2/tripleomaster/centos-binary-nova-compute-ironic/blobs/sha256:6d3a23ca3a1378376ca4268c06d7c7da7b25358e69ff389475e5a30b78549fbb

Have seen this before but first time capturing the bug - some recent examples in [1][2][3] - all from the same review. However note those jobs passed in check.

[1] http://logs.openstack.org/26/671526/4/gate/tripleo-ci-centos-7-undercloud-containers/5610169/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz#_2019-07-22_09_36_32_724
[2] http://logs.openstack.org/26/671526/4/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/a371bbe/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz#_2019-07-22_09_29_50_590
[3] http://logs.openstack.org/26/671526/4/gate/tripleo-ci-centos-7-standalone/11b184f/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz#_2019-07-22_09_13_01_835

Tags: ci
Revision history for this message
Michele Baldessari (michele) wrote :
Changed in tripleo:
milestone: none → train-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/672938

Revision history for this message
Marios Andreou (marios-b) wrote :

I checked logstash [1] and setting it to 7 days this morning I see it is happening quite a bit - attaching a png of what I saw here as 1837388.png

spent a while staring at the code in https://github.com/openstack/tripleo-common/blob/master/tripleo_common/image/image_uploader.py#L1690 trying to work out where/how to make this more robust.

I see we're already using tenacity.retry for most of the functions under upload_image but not upload_image itself. Posting it for discussion and to hopefully get better ideas https://review.opendev.org/672938

[1] http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Image%20prepare%20failed%5C%22

Revision history for this message
Alex Schultz (alex-schultz) wrote :

So if you look in the logs, we are already retrying the 401. It's like the content changed while the job was running such that the expected layer doesn't exist anymore. I don't think we're changing layers that frequently so it seems a bit odd as to why this fails so often. I wonder if we're getting bad caching/stale information from the ovh mirrors

Changed in tripleo:
milestone: train-2 → train-3
Revision history for this message
Rafael Folco (rafaelfolco) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Marios Andreou (<email address hidden>) on branch: master
Review: https://review.opendev.org/672938
Reason: was just posted for discussion in https://launchpad.net/bugs/1837388

Revision history for this message
Sorin Sbarnea (ssbarnea) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.