ovs quantum agent appears to stop responding after some time

Bug #1044135 reported by Naveen Joy
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Gary Kotton

Bug Description

Soon after starting the agent process (say after about 3 - 5 min) , the OVS quantum agent appears to stop responding. The agent was also not logging to the log_file configured in its .ini file. Therefore, I ran the agent from the command line and grabbed its output till the point it appeared to hang and not respond. When the VM ports are created, they are not bound to their vlan at this point. Please find attached data. Let me know if you need anything else.

Revision history for this message
Naveen Joy (najoy) wrote :
Revision history for this message
dan wendlandt (danwent) wrote :

Was noticing something like this while testing last night, but hadn't
gotten around to determining if it was in master or just the branch I
was testing.

Are you seeing this even if rpc = False? I noticed this only when my
settings got reset to make rpc = True, though I haven't investigated
it enough to know if that is the rpc flag is just a coincidence.

Another possible correlation I've noticed is that it occurs when I use
a script to spin up multiple VMs in succession, possibly suggesting
some kind of race/timing issue. Does that match your use case?

p.s. moving this to 'confirmed', as i've seen something like it as well, and so think its worthy of investigation.

Changed in quantum:
status: New → Confirmed
Revision history for this message
dan wendlandt (danwent) wrote :

Note: I don't recall seeing log issues, but I was just looking at console in devstack. I did use sudo ovs-vsctl list Port to confirm that the port was not correctly placed on a VLAN.

Note: there's a related minor bug that I've been intending to file. I believe this bug is specific to the RPC code. If there are old tap devices on br-int, but the q-agt is restarted and quantum no longer knows about these ports, those old ports are left there, on their original VLANs. it seems much safer to put any port not known to quantum on the dead vlan.

Revision history for this message
dan wendlandt (danwent) wrote :

garyk, could you take a look at this sometime friday your time?

Changed in quantum:
importance: Undecided → Critical
milestone: none → folsom-rc1
Revision history for this message
Naveen Joy (najoy) wrote :

This issue happens only when rpc = True and it happens consistently. When rpc =False, everything is fine. Also, I am just spinning up a single VM though the horizon interface without using any scripts. So there should not be any race conditions.

Revision history for this message
Aaron Rosen (arosen) wrote :

It's odd to me that in your log file the last thing it printed was:

2012-08-30 16:49:03 DEBUG [quantum.agent.linux.utils] Running command: sudo ovs-vsctl --timeout=2 get Interface tap6d6f30fc-77 external_ids

(I would expect to see the something like this after that:
2012-08-30 16:49:03 DEBUG [quantum.agent.linux.utils]
Command: ['sudo', 'ovs-vsctl', '--timeout=2', 'get', 'Interface', 'tap010d6c10-8e', 'external_ids']
Exit code: 0
Stdout: '{attached-mac="fa:16:3e:c9:98:20", iface-id="010d6c10-8ec9-438a-a02f-fab9eb333f01", iface-status=active}\n')

I'm surprised if that command is actually hanging. If you can reproduce this again when this happens can you do ps -eaf | grep ovs-vsctl to see if the ovs-vsctl command is hanging?

If the agent stops working ports will still attach to the bridge but won't be placed on a vlan.

Revision history for this message
dan wendlandt (danwent) wrote :

garyk, assigning to you in case you have a chance to look at this overnight. Arosen will also help look at this. thx.

Changed in quantum:
assignee: nobody → Gary Kotton (garyk)
Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
I have managed to reproduce something. I deployed 10 VM's. One of the VM's was not assigned a tag.
I am investigating.
Thanks
Gary

Revision history for this message
Gary Kotton (garyk) wrote :

Hi Naveen,
I am finding it difficult to reproduce. I have a few questions:
1. Are you using devstack or a local installation?
2. From the configuration file you are using it looks a bit outdated (a number of parameters have changed over the last few days)
3. Which linux flavor are you using?
4. Are you logging to a file or just the screen?
Thanks
Gary

Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
Regarding the logile. This is now doe via the quantum.conf file. Please look at https://docs.google.com/document/d/1EDLXQbuVWJ4MgTOC93WbyoDPWG4nnSD75PvSsxuuGuo/edit to see the confugration options.
Thanks
Gary

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/12297

Changed in quantum:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (master)

Reviewed: https://review.openstack.org/12297
Committed: http://github.com/openstack/quantum/commit/efa4316f648201c35c2175d6065b57abb8c8a7dc
Submitter: Jenkins
Branch: master

commit efa4316f648201c35c2175d6065b57abb8c8a7dc
Author: Gary Kotton <email address hidden>
Date: Sun Sep 2 11:15:32 2012 -0400

    Fixes agent problem with RPC

    Fixed bug 1044135

    When quantum-openvswitch-agent or quantum-linuxbridge-agent were
    called the eventlet monkey patch was not invoked.

    Change-Id: Iafb7fd02d37415c3466213d28280bcb4573de4a8

Changed in quantum:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in quantum:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in quantum:
milestone: folsom-rc1 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.