Incorrect remote logs symlinks on master node lead to deployment failure after several redeployments

Bug #1506112 reported by Iryna Vovk
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Dmitry Guryanov
6.1.x
Won't Fix
High
Rodion Tikunov
7.0.x
Fix Released
High
Rodion Tikunov
8.0.x
Fix Released
High
Rodion Tikunov
Mitaka
Fix Released
High
Dmitry Guryanov

Bug Description

During Contrail Plugin scale deployment with 6.1 release iso version we received such error:

OSError: [Errno 17] File exists

Steps to reproduce:
1. Install Contrail plugin on the master node;
1. Create cluster with gre segmentation and select ceph OSD;
2. Add 3 nodes with controller + mongo DB roles,
add 171 computes, 20 compute + ceph OSD roles and 3 base-os roles;
3. Configure interfaces;
4. Deploy cluster.

Actual result:
Deployment has failed with such error:

[7fcd6c1aa740] (manager) Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/nailgun/task/manager.py", line 58, in _call_silently
    to_return = method(task, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/nailgun/task/task.py", line 256, in message
    logs_utils.prepare_syslog_dir(node)
  File "/usr/lib/python2.6/site-packages/nailgun/utils/logs.py", line 81, in prepare_syslog_dir
    os.symlink(str(node.fqdn), l)
OSError: [Errno 17] File exists

additional info:
api: '1.0'
astute_sha: 1ea8017fe8889413706d543a5b9f557f5414beae
auth_required: true
build_id: 2015-06-19_13-02-31
build_number: '525'
feature_groups:
- mirantis
fuel-library_sha: 2e7a08ad9792c700ebf08ce87f4867df36aa9fab
fuel-ostf_sha: 8fefcf7c4649370f00847cc309c24f0b62de718d
fuelmain_sha: a3998372183468f56019c8ce21aa8bb81fee0c2f
nailgun_sha: dbd54158812033dd8cfd7e60c3f6650f18013a37
openstack_version: 2014.2.2-6.1
production: docker
python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
release: '6.1'
release_versions:
  2014.2.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 1ea8017fe8889413706d543a5b9f557f5414beae
      build_id: 2015-06-19_13-02-31
      build_number: '525'
      feature_groups:
      - mirantis
      fuel-library_sha: 2e7a08ad9792c700ebf08ce87f4867df36aa9fab
      fuel-ostf_sha: 8fefcf7c4649370f00847cc309c24f0b62de718d
      fuelmain_sha: a3998372183468f56019c8ce21aa8bb81fee0c2f
      nailgun_sha: dbd54158812033dd8cfd7e60c3f6650f18013a37
      openstack_version: 2014.2.2-6.1
      production: docker
      python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
      release: '6.1'

Iryna Vovk (ivovk)
Changed in fuel:
milestone: none → 6.1-mu-4
Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

Iryna, please provide diagnostic snapshot.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Iryna Vovk (ivovk) wrote :
Iryna Vovk (ivovk)
Changed in fuel:
status: Incomplete → New
importance: Undecided → Critical
Revision history for this message
Andrian Noga (anoga) wrote :

We need to fix it ASAP due the importance. On of our customer need Contrail plugin on 2 weeks. So i put status as Critical

Iryna Vovk (ivovk)
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
Revision history for this message
salmon (salmon) wrote :

I think problem starts here:

Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/nailgun/utils/__init__.py", line 47, in remove_silently
    shutil.rmtree(path)
  File "/usr/lib64/python2.6/shutil.py", line 212, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/usr/lib64/python2.6/shutil.py", line 221, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib64/python2.6/shutil.py", line 219, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/var/log/remote/10.20.0.51/bootstrap'

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 6.1-mu-4 → 8.0
importance: Critical → High
status: New → Confirmed
Revision history for this message
Andriy Popovych (popovych-andrey) wrote :
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Dmitry Pyzhov (dpyzhov)
tags: added: team-bugfix
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
importance: High → Medium
Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

We passed SCF in 8.0. Moving the bug to 9.0.

Changed in fuel:
milestone: 8.0 → 9.0
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

expected result

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Tricky issue with low reproducibility and known workaround. Moving to 10.0.

summary: - Scale deployment has failed on 200 nodes with 6.1 release iso version
+ Incorrect remote logs symlinks on master node lead to deployment failure
+ after several redeployments
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Workaround: remove broken symlinks in /var/log/remote directory.

Changed in fuel:
milestone: 9.0 → 10.0
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Here is root cause analysis from duplicate bug:

The error can occur if master node aperates slowly.

In this case it can execute code from

nailgun/nailgun/utils/logs.py

    # rename bootstrap directory into fqdn
    if os.path.islink(old):
        logger.debug("Old %s exists and it is link. "
                     "Trying to unlink", old)
        os.unlink(old)
    if os.path.isdir(old):
        logger.debug("Old %s exists and it is directory. "
                     "Trying to rename into %s", old, new)
        os.rename(old, new)
    else:
        logger.debug("Creating %s", new)
        os.makedirs(new)

    # creating symlinks
    for l in links:
        if os.path.islink(l) or os.path.isfile(l):
            logger.debug("%s already exists. "
                         "Trying to unlink", l)
            os.unlink(l)
        if os.path.isdir(l):
            logger.debug("%s already exists and it directory. "
                         "Trying to remove", l)
            shutil.rmtree(l)
---------------------
Then syslog server continues receiving messaqges from node, it recreates missing <ip>/bootstrap directory and writes some files there. Then nailgun process continues
------------------------
        logger.debug("Creating symlink %s -> %s", l, new)
        os.symlink(objects.Node.get_node_fqdn(node), l) <<<<<<<<<<<<<<< error!

Changed in fuel:
importance: Medium → High
assignee: Fuel Python (Deprecated) (fuel-python) → Fuel Sustaining (fuel-sustaining-team)
assignee: Fuel Sustaining (fuel-sustaining-team) → Dmitry Guryanov (dguryanov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/318544

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/318544
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=6e49910fcadf52cdeff235b0795f0dde2f63f60a
Submitter: Jenkins
Branch: master

commit 6e49910fcadf52cdeff235b0795f0dde2f63f60a
Author: Dmitry Guryanov <email address hidden>
Date: Thu May 19 12:14:56 2016 +0300

    fix race with syslog while renaming remote log dir

    When a node gets new name nailgun renmaes log directoty
    which had name in for mof <IP> to <node name> and then
    creates symlink <IP> -> <node name>. The problem is that
    node is still active and this moment and sends some logs
    to syslog on master node.

    So when we renamed <IP> to <node name> and haven't created
    symlink name syslog may find missing log dir and it creates
    it again in this case. Then nailgun tries to create symlink
    and fails.

    To avoid this situation we can suspend rsyslog temporarity.
    It's better to do it with SIGSTOP/SIGCONT, because process
    exist in this case and remote logs will be held in kernel
    tcp or odp buffers until we wake up rsyslog and it process
    new messages. Fortunately centos-7 uses journald together
    with rsyslog, so local logs (send by syslog function) are
    handled by journald and rsyslog just get it from journald
    files. So it's safe to stop rsyslog temporarily, nothing
    will be droppped.

    Change-Id: Icb1ed6669be2f7b448c465de33de37e0b945fb79
    Closes-Bug: #1506112

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/323193

Revision history for this message
Dmitry Guryanov (dguryanov) wrote :

ETA: 03.06.2016

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/mitaka)

Reviewed: https://review.openstack.org/323193
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=019ad234af0f8344102d3342587312a7ec5256fa
Submitter: Jenkins
Branch: stable/mitaka

commit 019ad234af0f8344102d3342587312a7ec5256fa
Author: Dmitry Guryanov <email address hidden>
Date: Thu May 19 12:14:56 2016 +0300

    fix race with syslog while renaming remote log dir

    When a node gets new name nailgun renmaes log directoty
    which had name in for mof <IP> to <node name> and then
    creates symlink <IP> -> <node name>. The problem is that
    node is still active and this moment and sends some logs
    to syslog on master node.

    So when we renamed <IP> to <node name> and haven't created
    symlink name syslog may find missing log dir and it creates
    it again in this case. Then nailgun tries to create symlink
    and fails.

    To avoid this situation we can suspend rsyslog temporarity.
    It's better to do it with SIGSTOP/SIGCONT, because process
    exist in this case and remote logs will be held in kernel
    tcp or odp buffers until we wake up rsyslog and it process
    new messages. Fortunately centos-7 uses journald together
    with rsyslog, so local logs (send by syslog function) are
    handled by journald and rsyslog just get it from journald
    files. So it's safe to stop rsyslog temporarily, nothing
    will be droppped.

    Change-Id: Icb1ed6669be2f7b448c465de33de37e0b945fb79
    Closes-Bug: #1506112
    (cherry picked from commit 6e49910fcadf52cdeff235b0795f0dde2f63f60a)

tags: added: on-verification
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/330993

Revision history for this message
Dmitry Guryanov (dguryanov) wrote :

Since it's hard to reproduce the problem (it's a race), I've checked code, and the patch is there.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/8.0)

Change abandoned by Rodion Tikunov (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/330993
Reason: Approach used in this patch does not work in <=8.0. In MOS=<8.0 nailgun works in docker container and can not communicate by signals with rsyslog which works in another container

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/332001

Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Steps to reproduce:
1) Configure cluster with 1 controller and 1 compute nodes
2) In nailgun container change modify file /usr/lib/python2.7/site-packages/nailgun/utils/logs.py by adding time.sleep(10) before os.symlink(objects.Node.get_node_fqdn(node), l)
3) rm /usr/lib/python2.7/site-packages/nailgun/utils/logs.py[oc]
4) touch /var/lib/nailgun-uwsgi - after that new logs.pyc should be created
5) On nodes (which in bootstrap stage) start to create extensive log traffic:
while true; do
    logger -t nailgun-agent "test123 tCDqdvB2+xbeYNU68n3Xd6q08kDZLdtFTz57xfTKn53icDSWr6gVm3SP842938jf908j098js9djf9sjdf9sj9jjjeroiwjroij8934nkjnwlkjrhf87f78hf3n4f24no2h478shf8shd8hfsdfhsd78w7y9r7hwe8rhwohrefw879erhf897werhf98w7hresiojoidjfviusj+0NBQIlFXpqpqdIEXA15 `date`";
done
6) Deploy cluster

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 6.1-updates because 6.1 is in Limited support now

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/367245

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/8.0)

Reviewed: https://review.openstack.org/332001
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=11ce99ab4e46fad061edd99c6342ea647f0acd75
Submitter: Jenkins
Branch: stable/8.0

commit 11ce99ab4e46fad061edd99c6342ea647f0acd75
Author: Rodion Tikunov <email address hidden>
Date: Mon Jun 20 18:48:12 2016 +0300

    fix race with syslog while renaming remote log dir

    When a node gets new name nailgun renames log directory which had name
    in form of <IP> to <node name> and then creates symlink <IP> ->
    <node name>. The problem is that node is still active at this moment
    and sends some logs to syslog on master node. So when we renamed <IP>
    to <node name> and haven't created symlink name syslog may find
    missing log dir and it creates it again in this case. Then nailgun
    tries to create symlink and fails.

    To avoid this situation this patch adds attempts to remove a directory
    and create symlink until symlink will being successfully created.

    Approach used in master does not work in <=8.0. In MOS=<8.0 nailgun
    and rsyslog work in different PID namespaces and can not interact
    with each other by signals.

    Change-Id: Ia402508dac427909aa1593787f9aba54f2df05e4
    Closes-bug: #1506112

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/7.0)

Reviewed: https://review.openstack.org/367245
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=b1da633fdac0f9eab6ee05d902dfbd3e9d57be05
Submitter: Jenkins
Branch: stable/7.0

commit b1da633fdac0f9eab6ee05d902dfbd3e9d57be05
Author: Rodion Tikunov <email address hidden>
Date: Mon Jun 20 18:48:12 2016 +0300

    fix race with syslog while renaming remote log dir

    When a node gets new name nailgun renames log directory which had name
    in form of <IP> to <node name> and then creates symlink <IP> ->
    <node name>. The problem is that node is still active at this moment
    and sends some logs to syslog on master node. So when we renamed <IP>
    to <node name> and haven't created symlink name syslog may find
    missing log dir and it creates it again in this case. Then nailgun
    tries to create symlink and fails.

    To avoid this situation this patch adds attempts to remove a directory
    and create symlink until symlink will being successfully created.

    Approach used in master does not work in <=8.0. In MOS=<8.0 nailgun
    and rsyslog work in different PID namespaces and can not interact
    with each other by signals.

    Change-Id: Ia402508dac427909aa1593787f9aba54f2df05e4
    Closes-bug: #1506112
    (cherry picked from commit 11ce99ab4e46fad061edd99c6342ea647f0acd75)

Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on MOS 7.0 + MU6 updates using steps from [0].

Before updates bug was reproduced - deployment has failed with error "[Errno 17] File exists".
After updates deployment has finished successfully.

[0] https://bugs.launchpad.net/fuel/+bug/1506112/comments/20

tags: removed: on-verification
tags: added: on-verification
Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on MOS 8.0 + MU4 updates.

Actual results:
Before updates bug was reproduced - deployment has failed with error "[Errno 17] File exists".
After updates deployment has finished successfully.

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.