Functional tests failing test_restart_l3_agent_on_sighup with "Timed out waiting for file /tmp/XXX/YYY/test_server.tmp to be created and its size become equal to ZZZ"

Bug #1518921 reported by Ihar Hrachyshka
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Sreekumar S

Bug Description

Example: http://logs.openstack.org/89/244089/4/check/gate-neutron-dsvm-functional/1ebb083/testr_results.html.gz

Example log:

ft1.123: neutron.tests.functional.agent.test_l3_agent.TestL3AgentRestart.test_restart_l3_agent_on_sighup_StringException: Empty attachments:
  pythonlogging:''
  pythonlogging:'neutron.api.extensions'
  stderr
  stdout

Traceback (most recent call last):
  File "neutron/tests/functional/agent/test_l3_agent.py", line 943, in test_restart_l3_agent_on_sighup
    workers=1)
  File "neutron/tests/functional/test_server.py", line 158, in _test_restart_service_on_sighup
    'size': expected_size}))
  File "neutron/agent/linux/utils.py", line 324, in wait_until_true
    eventlet.sleep(sleep)
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 34, in sleep
    hub.switch()
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
    return self.greenlet.switch()
RuntimeError: Timed out waiting for file /tmp/tmpsoYARR/tmpzLkipc/test_server.tmp to be created and its size become equal to 10.

Kibana: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22to%20be%20created%20and%20its%20size%20become%20equal%20%5C%22

243 hits in 7 days.

Changed in neutron:
importance: Undecided → High
tags: added: gate-failure
tags: added: functional-tests
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

OK, it's not that critical once you filter out -py34 for build_name field. (4 hits in 7 days).

Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

I think it might be the reason that the wait time is not long enough. There is no waiting for the restart to complete, just 5 seconds. IMO, the easiest solution is to enlarge the number at [1], but I don't have strong proof for this result.

[1] https://github.com/openstack/neutron/blob/master/neutron/tests/functional/test_server.py#L153

Revision history for this message
Elena Ezhova (eezhova) wrote :

The problem might be in the timeout and it is worth checking. But I'm also afraid that there can be some other reason why the message is not written into the file of which I don't know. I ran the l3 agent test suite locally about 300 times in a row and test_restart_l3_agent_on_sighup never failed.
I'm going to upload a change that increases the wait timeout and try rechecking it. However it's difficult to say. how many times should the test pass to be considered stable - 20, 50, 100?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/250234

Changed in neutron:
status: New → Confirmed
Revision history for this message
Elena Ezhova (eezhova) wrote : Re: Functional tests failing inside _test_restart_service_on_sighup with "Timed out waiting for file /tmp/XXX/YYY/test_server.tmp to be created and its size become equal to ZZZ"

Increasing the timeout doesn't help http://logs.openstack.org/34/250234/1/check/gate-neutron-dsvm-functional/9569e41/testr_results.html.gz . The problem must be in something else.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Elena Ezhova (<email address hidden>) on branch: master
Review: https://review.openstack.org/250234

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/251318

Changed in neutron:
assignee: nobody → Hong Hui Xiao (xiaohhui)
status: Confirmed → In Progress
Revision history for this message
Hong Hui Xiao (xiaohhui) wrote : Re: Functional tests failing inside _test_restart_service_on_sighup with "Timed out waiting for file /tmp/XXX/YYY/test_server.tmp to be created and its size become equal to ZZZ"

Based on Elena Ezhova's effort, I looked more into the code today, and submit a patch about the investigation. Will add more information if the test in review board success.

Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

From [1], it looks like time is not the problem(which comment #5 said :-( ), will investigate more about it...

[1] http://logs.openstack.org/18/251318/2/check/gate-neutron-dsvm-functional/2f1dbcf/testr_results.html.gz

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Hong Hui Xiao (<email address hidden>) on branch: master
Review: https://review.openstack.org/251318
Reason: Can resolve the problem, abandon it.

Revision history for this message
Sreekumar S (sreesiv) wrote : Re: Functional tests failing inside _test_restart_service_on_sighup with "Timed out waiting for file /tmp/XXX/YYY/test_server.tmp to be created and its size become equal to ZZZ"

2015-12-02 05:29:53.209 | 2015-12-02 05:29:53.160 |
2015-12-02 05:29:53.210 | 2015-12-02 05:29:53.161 | ==============================
2015-12-02 05:29:53.212 | 2015-12-02 05:29:53.163 | Failed 1 tests - output below:
2015-12-02 05:29:53.213 | 2015-12-02 05:29:53.164 | ==============================
2015-12-02 05:29:53.215 | 2015-12-02 05:29:53.166 |
2015-12-02 05:29:53.248 | 2015-12-02 05:29:53.168 | neutron.tests.functional.agent.test_l3_agent.TestL3AgentRestart.test_restart_l3_agent_on_sighup
2015-12-02 05:29:53.248 | 2015-12-02 05:29:53.169 | -----------------------------------------------------------------------------------------------
2015-12-02 05:29:53.249 | 2015-12-02 05:29:53.171 |
2015-12-02 05:29:53.249 | 2015-12-02 05:29:53.172 | Captured traceback:
2015-12-02 05:29:53.249 | 2015-12-02 05:29:53.173 | ~~~~~~~~~~~~~~~~~~~
2015-12-02 05:29:53.249 | 2015-12-02 05:29:53.175 | Traceback (most recent call last):
2015-12-02 05:29:53.249 | 2015-12-02 05:29:53.176 | File "neutron/tests/functional/agent/test_l3_agent.py", line 37, in test_restart_l3_agent_on_sighup
2015-12-02 05:29:53.250 | 2015-12-02 05:29:53.177 | workers=1)
2015-12-02 05:29:53.250 | 2015-12-02 05:29:53.179 | File "neutron/tests/functional/test_server.py", line 158, in _test_restart_service_on_sighup
2015-12-02 05:29:53.250 | 2015-12-02 05:29:53.180 | 'size': expected_size}))
2015-12-02 05:29:53.250 | 2015-12-02 05:29:53.181 | File "neutron/agent/linux/utils.py", line 326, in wait_until_true
2015-12-02 05:29:53.250 | 2015-12-02 05:29:53.182 | eventlet.sleep(sleep)
2015-12-02 05:29:53.250 | 2015-12-02 05:29:53.184 | File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 34, in sleep
2015-12-02 05:29:53.251 | 2015-12-02 05:29:53.185 | hub.switch()
2015-12-02 05:29:53.251 | 2015-12-02 05:29:53.186 | File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2015-12-02 05:29:53.251 | 2015-12-02 05:29:53.187 | return self.greenlet.switch()
2015-12-02 05:29:53.251 | 2015-12-02 05:29:53.188 | RuntimeError: Timed out waiting for file /tmp/tmp2Xmf33/tmpd3rEn3/test_server.tmp to be created and its size become equal to 10.

Fails for me as well. Somehow I am unable to restart the job too. 'check pipeline' seems not doing anything!

Revision history for this message
Sreekumar S (sreesiv) wrote :

This is intermittent and for some patches alone.

For the same patch this test had passes once but that time the run failed with Bug #1521815. Again when I restarted the run, it succeeded for tests mentioned in Bug #1521815, but failed with test_restart_l3_agent_on_sighup().

Somehow as per comments in both bug reports, these are not reproduced locally.

Revision history for this message
Sreekumar S (sreesiv) wrote :

Here's the patch https://review.openstack.org/#/c/251940/ where the tests are failing intermittently.

Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

I spend lots of time on investigating this bug.

The possible reason for the bug:
1) restart time is too long, I can simulate the error in description by adding delay in neutron.service.Service.stop. But from the log[1~4], there is no log "Caught SIGHUP, exiting", which means the sighup singal is not caught by process. So, the it is not the root cause.

2) oslo_service haven't finish initialization, and the sighup is sent out. So that the code can't handle sighup properly. So I made up [5], to make sure sighup is sent out after oslo_service finishing initialization. But I failed, the log it at [4]

3) problem in oslo_service.service.ServiceLauncher._handle_signal. For unknown reason, the sighup signal is caught by the process, but the exception is not caught at [6]. There are 4 restart functional test in total, but only test_restart_l3_agent_on_sighup will fail intermittently. The biggest difference of test_restart_l3_agent_on_sighup is that it is using oslo_service.service.ServiceLauncher, others are using oslo_service.service.ProcessLauncher. I am verifying it at [7]
.

[1] http://logs.openstack.org/18/251318/2/check/gate-neutron-dsvm-functional/2f1dbcf/logs/neutron.tests.functional.agent.test_l3_agent.TestL3AgentRestart.test_restart_l3_agent_on_sighup.log.txt.gz
[2] http://logs.openstack.org/34/250234/1/check/gate-neutron-dsvm-functional/9569e41/logs/neutron.tests.functional.agent.test_l3_agent.TestL3AgentRestart.test_restart_l3_agent_on_sighup.log.txt.gz
[3] http://logs.openstack.org/89/244089/4/check/gate-neutron-dsvm-functional/1ebb083/logs/neutron.tests.functional.agent.test_l3_agent.TestL3AgentRestart.test_restart_l3_agent_on_sighup.log.txt.gz
[4] http://logs.openstack.org/18/251318/3/check/gate-neutron-dsvm-functional/11f0801/logs/neutron.tests.functional.agent.test_l3_agent.TestL3AgentRestart.test_restart_l3_agent_on_sighup.log.txt.gz
[5] https://review.openstack.org/#/c/251318/3/ (the 3rd patchset)
[6] https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L263
[7] https://review.openstack.org/#/c/251318/4/ (the 4th patchset)

Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

Based on the investigation, I think it is a bug in oslo, I have reported it in [1]. More details can be found at the bug description in [1]. Although [2] can eliminate the intermittent failure of test case, it should not be merged, because neutron-l3-agent actually uses ServiceLauncher. However, [2] is just to verify that ProcessLauncher will not have problem in the same circumstance. Merging [2] is non-sense.

Moreover, I doubt the meaning of the test in neutron a little, "neutron.service.Service.start" is mocked, the test case here is more like a test case for ServiceLauncher in oslo_service, to test if it can accept SIGHUP and restart services.

Anyway, anyone else is interested in this bug, feel free to participate in.

[1] https://bugs.launchpad.net/oslo.service/+bug/1523333
[2] https://review.openstack.org/#/c/251318/4/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Hong Hui Xiao (<email address hidden>) on branch: master
Review: https://review.openstack.org/251318
Reason: More details can be found at comments in [1], this patch should not be merged, abandon it.

[1] https://bugs.launchpad.net/neutron/+bug/1518921

Changed in neutron:
status: In Progress → Invalid
Revision history for this message
Sreekumar S (sreesiv) wrote : Re: Functional tests failing inside _test_restart_service_on_sighup with "Timed out waiting for file /tmp/XXX/YYY/test_server.tmp to be created and its size become equal to ZZZ"

It's still failing...

2015-12-10 06:49:39.154 | 2015-12-10 06:49:39.107 | ==============================
2015-12-10 06:49:39.157 | 2015-12-10 06:49:39.110 | Failed 1 tests - output below:
2015-12-10 06:49:39.160 | 2015-12-10 06:49:39.113 | ==============================
2015-12-10 06:49:39.163 | 2015-12-10 06:49:39.116 |
2015-12-10 06:49:39.200 | 2015-12-10 06:49:39.119 | neutron.tests.functional.agent.test_l3_agent.TestL3AgentRestart.test_restart_l3_agent_on_sighup
2015-12-10 06:49:39.200 | 2015-12-10 06:49:39.122 | -----------------------------------------------------------------------------------------------
2015-12-10 06:49:39.200 | 2015-12-10 06:49:39.124 |
2015-12-10 06:49:39.200 | 2015-12-10 06:49:39.127 | Captured traceback:
2015-12-10 06:49:39.201 | 2015-12-10 06:49:39.130 | ~~~~~~~~~~~~~~~~~~~
2015-12-10 06:49:39.201 | 2015-12-10 06:49:39.133 | Traceback (most recent call last):
2015-12-10 06:49:39.201 | 2015-12-10 06:49:39.135 | File "neutron/tests/functional/agent/test_l3_agent.py", line 37, in test_restart_l3_agent_on_sighup
2015-12-10 06:49:39.201 | 2015-12-10 06:49:39.137 | workers=1)
2015-12-10 06:49:39.201 | 2015-12-10 06:49:39.140 | File "neutron/tests/functional/test_server.py", line 158, in _test_restart_service_on_sighup
2015-12-10 06:49:39.201 | 2015-12-10 06:49:39.142 | 'size': expected_size}))
2015-12-10 06:49:39.202 | 2015-12-10 06:49:39.143 | File "neutron/agent/linux/utils.py", line 326, in wait_until_true
2015-12-10 06:49:39.202 | 2015-12-10 06:49:39.145 | eventlet.sleep(sleep)
2015-12-10 06:49:39.202 | 2015-12-10 06:49:39.147 | File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 34, in sleep
2015-12-10 06:49:39.202 | 2015-12-10 06:49:39.149 | hub.switch()
2015-12-10 06:49:39.202 | 2015-12-10 06:49:39.150 | File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2015-12-10 06:49:39.203 | 2015-12-10 06:49:39.152 | return self.greenlet.switch()
2015-12-10 06:49:39.209 | 2015-12-10 06:49:39.161 | RuntimeError: Timed out waiting for file /tmp/tmpLx6WjW/tmptxIhtr/test_server.tmp to be created and its size become equal to 10.

https://review.openstack.org/#/c/251940/
May work if rechecked, but the issue still happens...

Changed in neutron:
status: Invalid → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/255920

Changed in neutron:
assignee: Hong Hui Xiao (xiaohhui) → Sreekumar S (sreesiv)
status: Confirmed → In Progress
Revision history for this message
Sreekumar S (sreesiv) wrote : Re: Functional tests failing inside _test_restart_service_on_sighup with "Timed out waiting for file /tmp/XXX/YYY/test_server.tmp to be created and its size become equal to ZZZ"

This is just to test out the runs on the server.
@xiaohhui, have you tried changing this before. Please see it is valid and works well.

This is just based on my doubts, will abandon if it doesn't work.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Sreekumar S (<email address hidden>) on branch: master
Review: https://review.openstack.org/255920
Reason: Nope, that's not it. Still fails with the same issue as can be seen from
http://logs.openstack.org/20/255920/1/check/gate-neutron-dsvm-functional/6bf11c8/console.html

I think there needs to be some check to see when neutron-l3 agent starts up properly, just as it is done with wsgi server. The hard coded sleep scheme won't work, especially when there is load on the server. At least it may be indicative to some trouble with the agent service startup if we can cap it to 25 secs or so, i.e. with some mechanism to check its up state/proper startup.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote : Re: Functional tests failing inside _test_restart_service_on_sighup with "Timed out waiting for file /tmp/XXX/YYY/test_server.tmp to be created and its size become equal to ZZZ"

Can't see it show up lately.

Changed in neutron:
status: In Progress → Incomplete
assignee: Sreekumar S (sreesiv) → nobody
Revision history for this message
Hong Hui Xiao (xiaohhui) wrote :

All my investigation are recorded at comment #15 and #16, I think it is a oslo bug. If it bothers neutron a lot, I would recommend skip this test.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/257947

Changed in neutron:
assignee: nobody → Sreekumar S (sreesiv)
status: Incomplete → In Progress
Revision history for this message
Sreekumar S (sreesiv) wrote : Re: Functional tests failing inside _test_restart_service_on_sighup with "Timed out waiting for file /tmp/XXX/YYY/test_server.tmp to be created and its size become equal to ZZZ"

I am able to reproducible this locally by setting 'max pending signals' of the shell using ulimit to a low value, like 10.
`ulimit -i 10`
and then running 'test_restart_l3_agent_on_sighup' repeatedly couple of time, like 30 or so. If the delays are low, and the margin between 'ulimit -i' and no of concurrent runs are high, then the test will fail with timeout.
 So I've kept a high value for the check active delay and wait. Also I am increasing the ulimit -i value to 'unlimited' temporarily using 'subprocess.call'.

This patch is just to check the run success rate on the test server. If the test is passing consistently over the coming days then the 'ulimit -i unlimited' has to be set permanently on the server via configuration.

If this one also fails, will abandon the patch & abort the mission. :-)

Revision history for this message
Sreekumar S (sreesiv) wrote :

This patch will be abandoned in any case (success or failure) as the fix will be to configure the server.

May be we need to adjust the check active and wait delays to slightly higher values, but that can be done once the server is configured and runs are checked.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Sreekumar S (sreesiv) wrote :

The same test for my patch is passing consistently. The setting call I'm making is session specific since this is a shell primitive.

@Armando what's the procedure to request for setting the ulimit of 'max pending signals' of the test server to 'unlimited' permanently in the configuration.
I believe that will resolve the issue and will prevent intermittent failures. It happens with heavy load and missing signals.

Once the configuration is done, I'll remove the call to `ulimit -i unlimited` from my patch, and will check the runs again. Meanwhile we'll keep an eye on whether the same test is failing for other patches. That'll indicate whether we need to tune the check active and wait delays.

Assaf Muller (amuller)
summary: - Functional tests failing inside _test_restart_service_on_sighup with
- "Timed out waiting for file /tmp/XXX/YYY/test_server.tmp to be created
- and its size become equal to ZZZ"
+ Functional tests failing test_restart_l3_agent_on_sighup with "Timed out
+ waiting for file /tmp/XXX/YYY/test_server.tmp to be created and its size
+ become equal to ZZZ"
Revision history for this message
Assaf Muller (amuller) wrote :

After looking at the test and its scope, it seems like 100% of the code it tests is owned by oslo.service. That library has its own tests, is there value in keeping the test in Neutron?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/260236

Changed in neutron:
assignee: Sreekumar S (sreesiv) → Assaf Muller (amuller)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I am in favor of the removal of this test from Neutron. [1] was another, and I also felt there too that we were aggressively testing against oslo.service internals.

[1] https://bugs.launchpad.net/neutron/+bug/1505438

Revision history for this message
Sreekumar S (sreesiv) wrote :

I agree with moving this test. But the server max pending signals configuration (ulimit -i unlimited) has to be done IMO, otherwise with load there are chances that it'll fail there too. It is simply a load issue, and there is no bug with oslo framework.

I think Bug #1523333 is invalid. But more tests may be done there.

Changed in neutron:
assignee: Assaf Muller (amuller) → Sreekumar S (sreesiv)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Sreekumar S (<email address hidden>) on branch: master
Review: https://review.openstack.org/257947
Reason: All agreed to kill the functional test with patch https://review.openstack.org/#/c/260236/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/260236
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5086a0ba7a5b5f08492141f346be93b11112ebad
Submitter: Jenkins
Branch: master

commit 5086a0ba7a5b5f08492141f346be93b11112ebad
Author: Assaf Muller <email address hidden>
Date: Mon Dec 21 17:50:36 2015 -0500

    Delete test_restart_l3_agent_on_sighup

    After looking at the test and its scope, it seems like 100% of the
    code it tests is owned by oslo.service. That library has its own tests,
    is there value in keeping the test in Neutron?

    Change-Id: I0d8a22cc8127e183c1f55790f4b04dea2d30d356
    Closes-Bug: #1518921

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b2

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.