Fullstack tests fails because process is not killed properly

Bug #1798472 reported by Slawek Kaplonski
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Slawek Kaplonski

Bug Description

Fullstack tests are failing quite often recently. There are different tests failed in CI runs but it looks that the culprit each time is the same. Some of processes spawned during the test is not killed properly, hangs and test got timeout exception.

Examples:
http://logs.openstack.org/97/602497/5/check/neutron-fullstack/f110a1f/logs/testr_results.html.gz

http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/testr_results.html.gz

In second example it looks that some process wasn't exited properly: http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetwork.test_connectivity_GRE-l2pop-arp_responder,openflow-native_.txt.gz#_2018-10-16_02_43_49_755
and in this example it looks that it is openvswitch-agent: http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetwork.test_connectivity_GRE-l2pop-arp_responder,openflow-native_/neutron-openvswitch-agent--2018-10-16--02-42-43-987526.txt.gz

Looking at logs of this ovs agent it looks that there is no log like "Agent caught SIGTERM, quitting daemon loop." at the end

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I used log stash query like: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22%2C%20line%2078%2C%20in%20stop%5C%22%20AND%20project%3A%5C%22openstack%2Fneutron%5C%22 to check those issues.
From what I see in couple of examples (all which I checked) it's always problem with neutron-openvswitch-agent which don't catch SIGTERM for some reason.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I compared some "good" and "bad" run of ovs agent.
From what I see there it looks that this agent was already "not responding".
In "good" log there is entry about subnet_delete, then network_delete and than catch SIGTERM, see:
http://logs.openstack.org/93/615893/1/gate/neutron-fullstack/bf3dc84/logs/dsvm-fullstack-logs/TestBwLimitQoSOvs.test_bw_limit_qos_policy_rule_lifecycle_ingress,openflow-native_/neutron-openvswitch-agent--2018-11-07--11-45-30-405950.txt.gz#_2018-11-07_11_45_50_152

In "bad" run, there is info about subnet_delete and that's all - there is no info about network_delete (which happened in server) and no info about catch SIGTER, see:

http://logs.openstack.org/93/615893/1/gate/neutron-fullstack/bf3dc84/logs/dsvm-fullstack-logs/TestBwLimitQoSOvs.test_bw_limit_qos_port_removed_ingress,openflow-cli_/neutron-openvswitch-agent--2018-11-07--11-42-09-570698.txt.gz#_2018-11-07_11_42_30_454

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/618024

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/618024
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9b23abbdb68f7e0c80c305ec1874281f6dea7e9e
Submitter: Zuul
Branch: master

commit 9b23abbdb68f7e0c80c305ec1874281f6dea7e9e
Author: Slawek Kaplonski <email address hidden>
Date: Wed Nov 14 21:31:04 2018 +0100

    Add kill_timeout to AsyncProcess

    AsyncProcess.stop() method has now additional parameter
    kill_timeout. If this is set to some value different than
    None, eventlet.green.subprocess.Popen.wait() will be called
    with this timeout, so TimeoutExpired exception will be raised
    in case if process will not be killed for this "kill_timeout"
    time.
    In such case process will be killed "again" with SIGKILL signal
    to make sure that it is gone.

    This should fix problem with failing fullstack tests, when
    ovs_agent process is sometimes not killed and test timeout was
    reached in this wait() method.

    Change-Id: I1e12255e5e142c395adf4e67be9d9da0f7a3d4fd
    Closes-Bug: #1798472

Changed in neutron:
status: In Progress → Fix Released
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0b1

This issue was fixed in the openstack/neutron 14.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/628396

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/628397

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/628398

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/628396
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=025e767b94cf29e522a67e3c1ddd8aa3dba9140a
Submitter: Zuul
Branch: stable/rocky

commit 025e767b94cf29e522a67e3c1ddd8aa3dba9140a
Author: Slawek Kaplonski <email address hidden>
Date: Wed Nov 14 21:31:04 2018 +0100

    Add kill_timeout to AsyncProcess

    AsyncProcess.stop() method has now additional parameter
    kill_timeout. If this is set to some value different than
    None, eventlet.green.subprocess.Popen.wait() will be called
    with this timeout, so TimeoutExpired exception will be raised
    in case if process will not be killed for this "kill_timeout"
    time.
    In such case process will be killed "again" with SIGKILL signal
    to make sure that it is gone.

    This should fix problem with failing fullstack tests, when
    ovs_agent process is sometimes not killed and test timeout was
    reached in this wait() method.

    Change-Id: I1e12255e5e142c395adf4e67be9d9da0f7a3d4fd
    Closes-Bug: #1798472
    (cherry picked from commit 9b23abbdb68f7e0c80c305ec1874281f6dea7e9e)

tags: added: in-stable-rocky
tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/628397
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b5a0401472246dd3efa48631766faa72634ba36b
Submitter: Zuul
Branch: stable/queens

commit b5a0401472246dd3efa48631766faa72634ba36b
Author: Slawek Kaplonski <email address hidden>
Date: Wed Nov 14 21:31:04 2018 +0100

    Add kill_timeout to AsyncProcess

    AsyncProcess.stop() method has now additional parameter
    kill_timeout. If this is set to some value different than
    None, eventlet.green.subprocess.Popen.wait() will be called
    with this timeout, so TimeoutExpired exception will be raised
    in case if process will not be killed for this "kill_timeout"
    time.
    In such case process will be killed "again" with SIGKILL signal
    to make sure that it is gone.

    This should fix problem with failing fullstack tests, when
    ovs_agent process is sometimes not killed and test timeout was
    reached in this wait() method.

    Change-Id: I1e12255e5e142c395adf4e67be9d9da0f7a3d4fd
    Closes-Bug: #1798472
    (cherry picked from commit 9b23abbdb68f7e0c80c305ec1874281f6dea7e9e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/628398
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c86473d1a66830df1ce8eae28953e96ab96d53ea
Submitter: Zuul
Branch: stable/pike

commit c86473d1a66830df1ce8eae28953e96ab96d53ea
Author: Slawek Kaplonski <email address hidden>
Date: Wed Nov 14 21:31:04 2018 +0100

    Add kill_timeout to AsyncProcess

    AsyncProcess.stop() method has now additional parameter
    kill_timeout. If this is set to some value different than
    None, eventlet.green.subprocess.Popen.wait() will be called
    with this timeout, so TimeoutExpired exception will be raised
    in case if process will not be killed for this "kill_timeout"
    time.
    In such case process will be killed "again" with SIGKILL signal
    to make sure that it is gone.

    This should fix problem with failing fullstack tests, when
    ovs_agent process is sometimes not killed and test timeout was
    reached in this wait() method.

    Conflicts:
        neutron/agent/linux/async_process.py

    Change-Id: I1e12255e5e142c395adf4e67be9d9da0f7a3d4fd
    Closes-Bug: #1798472
    (cherry picked from commit 9b23abbdb68f7e0c80c305ec1874281f6dea7e9e)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.7

This issue was fixed in the openstack/neutron 11.0.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.3

This issue was fixed in the openstack/neutron 13.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.6

This issue was fixed in the openstack/neutron 12.0.6 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.