test_keepalived_respawns fails when trying to kill -15

Bug #1490043 reported by Assaf Muller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Armando Migliaccio

Bug Description

test_keepalived_respawns spawns keepalived, asserts that it's up, kills it, then waits for the process monitor to respawn it. Sometimes, the test seems to fail when sending signal 15 to the process.

Logstash:
message:"pm.disable(sig='15')'" AND tags:"console"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwicG0uZGlzYWJsZShzaWc9JzE1JyknXCIgQU5EIHRhZ3M6XCJjb25zb2xlXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0NDA4MDA0MTU5OTZ9

3 hits in the last 7 days.

Example console log:
http://logs.openstack.org/91/215491/2/gate/gate-neutron-dsvm-functional/a5ea84a/console.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/218578

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

5 hits in 7 days

Changed in neutron:
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/218578
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cc20673d673113974c78a2b17a9ff4da47ad6665
Submitter: Jenkins
Branch: master

commit cc20673d673113974c78a2b17a9ff4da47ad6665
Author: Assaf Muller <email address hidden>
Date: Sat Aug 29 11:32:19 2015 -0400

    Add info to debug test_keepalived_respawns gate failure

    Current theory is that there's a bug in external_process.active,
    it returns True when it shouldn't, then kill -15 on the process
    pid fails because the process isn't up. Added ps -p output to
    see if the process is up or not.

    Change-Id: Ic062be829d5f05a1294f6b2fa54820422871ffcb
    Related-Bug: #1490043

Changed in neutron:
milestone: none → liberty-rc1
Revision history for this message
Jakub Libosvar (libosvar) wrote :

Maybe this errors might be interesting - I'm not keepalived expert but someone might tell us whether keepalived terminates itself between process.active and process.disable()

Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived[22310]: Starting Keepalived v1.2.7 (08/14,2013)
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived[22312]: Starting VRRP child process, pid=22313
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: Interface queue is empty
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: No such interface, ovs-system
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: No such interface, dummy0
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: No such interface, test-br97456312
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: No such interface, br-int183506321
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: No such interface, br-tun481397143
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: No such interface, test-veth018779
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: Registering Kernel netlink reflector
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: Registering Kernel netlink command channel
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: Error while registering gratuitous ARP shared channel
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: Initializing ipvs 2.6
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived[22312]: Stopping Keepalived v1.2.7 (08/14,2013)
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 modprobe: ERROR: could not insert 'ip_vs': Operation not permitted
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: IPVS: Can't initialize ipvs: Permission denied (you must be root)
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: Configuration is using : 55952 Bytes
Sep 3 07:00:32 devstack-trusty-hpcloud-b3-4739380 Keepalived_vrrp[22313]: Using LinkWatch kernel netlink reflector...

Also note that keepalived is actually 22313 [1] while we send sigterm to 22312 [2] which is likely a parent process of keepalived (maybe sudo, maybe rootwrap). I'm investigating how ProcessManager makes sure the actual child process is written to pid file so it doesn't contain wrong values in case machine is too slow and children weren't born yet.

[1] http://logs.openstack.org/91/189391/18/check/gate-neutron-dsvm-functional/41e8980/logs/syslog.txt.gz#_Sep__3_07_00_32
[2] http://logs.openstack.org/91/189391/18/check/gate-neutron-dsvm-functional/41e8980/testr_results.html.gz

Revision history for this message
Jakub Libosvar (libosvar) wrote :

Ok, got it. keepalived parent creates the pid file so we don't need to check if the pid file contains correct pid. And 22313 is vrrp child while 22312 is the keepalived parent. So that eliminates option we send signal to bad process.

Changed in neutron:
importance: Medium → High
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Jakub: so does it mean you're taking over this one?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The fact that kill doesn't return anything on stderr is puzzling...I wish we would have used os.kill instead of shelling out to invoke kill.

Having said that, I wonder if a better way to simulate the situation you want to test here is to signal -9 without using process.disable.

 We could do after checking that signal 0 has returned positively. That is ultimately the green flag that you can simulate the process crash.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Changed in neutron:
assignee: nobody → Armando Migliaccio (armando-migliaccio)
Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/222460
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5405d9742b94f203389f555c56727a66925e9454
Submitter: Jenkins
Branch: master

commit 5405d9742b94f203389f555c56727a66925e9454
Author: armando-migliaccio <email address hidden>
Date: Thu Sep 10 21:54:33 2015 -0700

    Tweak test_keepalived_respawns test logic

    This test initial design is problematic: it spawns keepalived,
    it asserts the process is up, then it attempts to kill it.

    However, this is when problems may arise:

    a) it does so by using the disable method on the process - we
       should be more rude than that if we want to simulate a crash!

    b) keepalived may be forking while it is starting and it is
       possible that for a moment the ppid changes and the process
       owner invoking the kill has no rights to kill the spawned
       process. This is the most plausible explaination I could find
       as to why kill returns 1 with no standard error

    c) it does not verify that the process has indeed disappeared
       (what if the pm.disable didn't work?) - this means that the
       test can pass, and yet the monitor may not work.

    Bottom line: this test relied on the correctness of the very code
    that was meant to validate...and that's not cool. To this aim, we
    wait for the process to be active, kill the process with a kill -9
    and verify that the process after the kill is indeed different.

    Closes-bug: #1490043

    Change-Id: Idaf419a1464d9d0d75b9106a7acd5cd960a7c623

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/224334

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/224357

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/pecan)
Download full text (73.6 KiB)

Reviewed: https://review.openstack.org/224357
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fdc3431ccd219accf6a795079d9b67b8656eed8e
Submitter: Jenkins
Branch: feature/pecan

commit fe236bdaadb949661a0bfb9b62ddbe432b4cf5f1
Author: Miguel Angel Ajo <email address hidden>
Date: Thu Sep 3 15:40:12 2015 +0200

    No network devices on network attached qos policies

    Network devices, like internal router legs, or dhcp ports
    should not be affected by bandwidth limiting rules.

    This patch disables application of network attached policies
    to network/neutron owned ports.

    Closes-bug: #1486039
    DocImpact

    Change-Id: I75d80227f1e6c4b3f5fa7762b8dc3b0c0f1abd46

commit db4a06f7caa20a4c7879b58b20e95b223ed8eeaf
Author: Ken'ichi Ohmichi <email address hidden>
Date: Wed Sep 16 10:04:32 2015 +0000

    Use tempest-lib's token_client

    Now tempest-lib provides token_client modules as library and the
    interface is stable. So neutron repogitory doesn't need to contain
    these modules.
    This patch makes neutron use tempest-lib's token_client and removes
    the own modules for the maintenance.

    Change-Id: Ieff7eb003f6e8257d83368dbc80e332aa66a156c

commit 78aed58edbe6eb8a71339c7add491fe9de9a0546
Author: Jakub Libosvar <email address hidden>
Date: Thu Aug 13 09:08:20 2015 +0000

    Fix establishing UDP connection

    Previously, in establish_connection() for UDP protocol data were sent
    but never read on peer socket. That lead to successful read on peer side
    if this connection was filtered. Having constant testing string masked
    this issue as we can't distinguish to which test of connectivity data
    belong.

    This patch makes unique data string per test_connectivity() and
    also makes establish_connection() to create an ASSURED entry in
    conntrack table. Finally, in last test after firewall filter was
    removed, connection is re-established in order to avoid troubles with
    terminated processes or TCP continuing sending packets which weren't
    successfully delivered.

    Closes-Bug: 1478847
    Change-Id: I2920d587d8df8d96dc1c752c28f48ba495f3cf0f

commit e6292fcdd6262434a7b713ad8802db6bc8a6d3dc
Author: YAMAMOTO Takashi <email address hidden>
Date: Wed Sep 16 13:20:51 2015 +0900

    ovsdb: Fix a few docstring

    Change-Id: I53e1e21655b28fe5da60e58aeeb7cbbd103ae014

commit c22949a4449d96a67caa616290cf76b67b182917
Author: fumihiko kakuma <email address hidden>
Date: Wed Sep 16 11:52:59 2015 +0900

    Remove requirements.txt for the ofagent mechanism driver

    It is no longer used.

    Related-Blueprint: core-vendor-decomposition
    https://blueprints.launchpad.net/neutron/+spec/core-vendor-decomposition

    Change-Id: Ib31fb3febf8968e50d86dd66e1e6e1ea2313f8ac

commit d1d4de19d85f961d388c91e70f31b3bafec418c5
Author: Kevin Benton <email address hidden>
Date: Thu Sep 3 20:25:57 2015 -0700

    Always return iterables in L3 get_candidates

    The caller of this function expects iterables.

    Closes-Bug: #1494996
    Change-Id: I3d103e63f4e127a77268502415c0ddb0d804b54a

commit 1ad6ac448067306...

tags: added: in-feature-pecan
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (feature/pecan)

Change abandoned by Doug Wiegley (<email address hidden>) on branch: feature/pecan
Review: https://review.openstack.org/224334

Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-rc1 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.