DVR: dvr router ns should not exist in scheduled DHCP agent nodes

Bug #1609217 reported by LIU Yulong
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
LIU Yulong

Bug Description

ENV:
stable/mitaka
hosts:
compute1 (nova-compute, l3-agent (dvr), metedate-agent)
compute2 (nova-compute, l3-agent (dvr), metedate-agent)
network1 (l3-agent (dvr_snat), metedata-agent, dhcp-agent)
network2 (l3-agent(dvr_snat), metedata-agent, dhcp-agent)

How to reproduce? (scenario 1)
set: dhcp_agents_per_network = 2

1. create a DVR router:
neutron router-create --ha False --distributed True test1

2. Create a network & subnet with dhcp enabled.
neutron net-create test1
neutron subnet-create --enable-dhcp test1 --name test1 192.168.190.0/24

3. Attach the router and subnet
neutron router-interface-add test1 subnet=test1

Then the router test1 will exist in both network1 and network2. But in the DB routerl3agentbindings, there is only one record for DVR router to one l3 agent.

http://paste.openstack.org/show/547695/

And for another scenario 2:
change the network2 node deployment to only run metedata-agent, dhcp-agent.
Both in the qdhcp-namespace and the VM could ping each other.
So qrouter-namespace in the not-binded network node is not used, and should not exist.

Code:
The essential code issue may be DHCP port should not be considered in DVR host query.
https://github.com/openstack/neutron/blob/master/neutron/common/utils.py#L258

LIU Yulong (dragon889)
description: updated
description: updated
LIU Yulong (dragon889)
description: updated
description: updated
LIU Yulong (dragon889)
Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
LIU Yulong (dragon889)
Changed in neutron:
assignee: LIU Yulong (dragon889) → nobody
description: updated
LIU Yulong (dragon889)
tags: added: l3-dvr-backlog
tags: removed: l3-dvr-backlog
Revision history for this message
Brian Haley (brian-haley) wrote :

I do not think this is a bug, it is expected behavior.

On the first network node, the dvr_snat router was scheduled and created - so you should see both qrouter- and snat- namespaces. It will handle all routing and snat for instances.

On the other network node, the DHCP agent was scheduled. In that case, a DVR router is required for the dhcp server to be able to route, so a qrouter- was created for that purpose.

If you moved the dhcp agent to the other network node you should see that qrouter- namespace get deleted. Can you try that and confirm it works?

Revision history for this message
LIU Yulong (dragon889) wrote :

@Brian, hi.
So if qrouter-ns is needed,
why scenario 2 test shows that qdhcp-ns and VM are traffic reachable?

Revision history for this message
Brian Haley (brian-haley) wrote :

In scenarios 2 I'm assuming the VM and dhcp are in the same subnet? If you created a VM in another subnet then you need the qrouter-ns in order to ping it - East/West routing.

Revision history for this message
LIU Yulong (dragon889) wrote :
Download full text (8.4 KiB)

I don't think that such case needs East/West routing.
In the qdhcp-ns the tap-device has all subnets' IP address, and its directly connected in both subnets, for instance:

# Network1
[root@network1 neutron]# ip netns exec qdhcp-9826c8f0-3269-4546-a333-49d31046edcd ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
99: tapb41b71a9-d8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:ae:0b:da brd ff:ff:ff:ff:ff:ff
    inet 192.168.199.2/24 brd 192.168.199.255 scope global tapb41b71a9-d8
       valid_lft forever preferred_lft forever
    inet 169.254.169.254/16 brd 169.254.255.255 scope global tapb41b71a9-d8
       valid_lft forever preferred_lft forever
    inet 192.168.99.3/24 brd 192.168.99.255 scope global tapb41b71a9-d8
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:feae:bda/64 scope link
       valid_lft forever preferred_lft forever

# Network2
[root@network2 neutron]# ip netns exec qdhcp-9826c8f0-3269-4546-a333-49d31046edcd ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
149: tap0e943ce1-d8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN
    link/ether fa:16:3e:de:96:e7 brd ff:ff:ff:ff:ff:ff
    inet 192.168.199.7/24 brd 192.168.199.255 scope global tap0e943ce1-d8
       valid_lft forever preferred_lft forever
    inet 169.254.169.254/16 brd 169.254.255.255 scope global tap0e943ce1-d8
       valid_lft forever preferred_lft forever
    inet 192.168.99.2/24 brd 192.168.99.255 scope global tap0e943ce1-d8
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fede:96e7/64 scope link
       valid_lft forever preferred_lft forever

[yulong@controller ~]$ neutron port-show b41b71a9-d88d-42b5-bd35-554baf916a40
+-----------------------+--------------------------------------------------------------------------------------+
| Field | Value |
+-----------------------+--------------------------------------------------------------------------------------+
| admin_state_up | True |
| allowed_address_pairs | |
| binding:host_id | network1 |
| binding:profile | {} |
| binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} |
| binding:vif_type | ovs ...

Read more...

Revision history for this message
LIU Yulong (dragon889) wrote :

Let's dig this more specific.
In the comments 4 here, https://bugs.launchpad.net/neutron/+bug/1609217/comments/4,
I've tested the DVR router connected to ONE network with TWO different subnet. Its directly traffic reachable.

And today I've tested the TWO network connected to ONE DVR router scenario.
And also, I don't think that DVR router should existed in ONE network's every DHCP agent node.
Because when VM is booting, the DHCP request should and must only sent to it's relevant network DHCP agent.
Also do not need East/West routing.

By a experimentally test, I removed the following line.
https://github.com/openstack/neutron/blob/master/neutron/common/utils.py#L258
Then the DVR router does not exist in one network's every DHCP agent node.
And VM could get a IP from network DHCP agent.
And by connected all the subnet to the DVR router, the VM could reach all subnets.
But I don't know whether such change has other side effect.

Any one has any ideas? And correct me if I have some wrong?

tags: added: l3-dvr-backlog
Changed in neutron:
status: New → Incomplete
Revision history for this message
LIU Yulong (dragon889) wrote :

@Armando Migliaccio (armando-migliaccio),
Hi, Could you please explain why this bug was marked as incomplete?
Thank you very much.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Based on the Brian's comment #1 and follow-up comments, I assumed we're missing enough information to reproduce and confirm this is an issue.

LIU Yulong (dragon889)
Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: Incomplete → In Progress
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

I agree with brian and it is as per design. The database will show the l3agentbinding to the router in a single entry. Then the corresponding routers will be created by the agent based on the functionality.
Since you have the dhcp agent the router is left out there and dhcp port is considered as a DVR service port and that is the reason you see the router namespace in there.

Revision history for this message
LIU Yulong (dragon889) wrote :

DHCP port does not need East/West routing. And the traffic for DHCP request must limit to its own broadcast domain. Without the dvr namespace it already reachable. The created dvr namespace due to that DHCP port is totally redundant.

And after some tests, I've noticed this bug may be the fundamental cause of some DVR+HA issues.
Such as:
HA router went to the wrong host: https://bugs.launchpad.net/neutron/+bug/1597461.

Changed in neutron:
status: In Progress → Invalid
LIU Yulong (dragon889)
Changed in neutron:
status: Invalid → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/364793

Revision history for this message
Brian Haley (brian-haley) wrote : Re: DVR: dvr router should not exist in not-binded network node

The bug you mentioned in #9 was fixed by another change, so don't think it's relevant here any more.

I will look at the patch but am still on the fence about changing the design.

LIU Yulong (dragon889)
description: updated
summary: - DVR: dvr router should not exist in not-binded network node
+ DVR: dvr router should not exist in not-binded node
LIU Yulong (dragon889)
summary: - DVR: dvr router should not exist in not-binded node
+ DVR: dvr router ns should not exist in scheduled DHCP agent nodes
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https://review.openstack.org/364793
Reason: Yeah, DNS blocks the way. This change should also work with DHCP agent dnsmasq_dns_servers, aka Carl's comments.

Needs more test. Currently it's only ok for `--no-resolv` and user set DNS servers.

Changed in neutron:
importance: Undecided → Wishlist
Revision history for this message
Brian Haley (brian-haley) wrote :

Is this being worked on or can we close the bug?

LIU Yulong (dragon889)
Changed in neutron:
status: In Progress → Invalid
assignee: LIU Yulong (dragon889) → nobody
Revision history for this message
LIU Yulong (dragon889) wrote :

We recently meet such issue again in a large cloud deployment. The network node resource consumption is very serious. After a brief conversation with Brian, I decide to reopen this to gather more opinions.
My thoughts is to add a config for such scale issue, if the DNS function is not needed for subnets, the cloud deployment can disable it to reduce unnecessary consumption of network nodes.

Changed in neutron:
status: Invalid → Opinion
importance: Wishlist → Medium
Revision history for this message
LIU Yulong (dragon889) wrote :

We have the following config settings now:
dnsmasq_dns_servers
https://github.com/openstack/neutron/blob/master/neutron/conf/agent/dhcp.py#L85
dnsmasq_local_resolv
https://github.com/openstack/neutron/blob/master/neutron/conf/agent/dhcp.py#L94

So a related config for l3-plugin is correlative, if DHCP does not do DNS work, we can stop schedule the router to it.

Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: Opinion → In Progress
Revision history for this message
Brian Haley (brian-haley) wrote :

This second one, dnsmasq_local_resolv is not an equivalent - it's deciding whether dnsmasq will look at the local /etc/resolv.conf file, not whether it provides DNS service, as it can still look up local IPs on the private subnets. In other words, by default this is False, but on a booted instance I can still:

$ nslookup 10.0.0.49
Server: 10.0.0.2
Address 1: 10.0.0.2 host-10-0-0-2.openstacklocal

Name: 10.0.0.49
Address 1: 10.0.0.49 host-10-0-0-49.openstacklocal

Revision history for this message
LIU Yulong (dragon889) wrote :

@Brian,
Maybe I'm missing something. So if this is not related, we can add a new config for dhcp to disable the DNS entirely. That means user should always set some DNS nameserver for the VM, not the dhcp port IP by default. Or, the VM dns config will be empty.

Changed in neutron:
assignee: LIU Yulong (dragon889) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → LIU Yulong (dragon889)
Revision history for this message
LIU Yulong (dragon889) wrote :

Increase the bug level, because this issue has been submitted for a long time.

Changed in neutron:
importance: Medium → High
Changed in neutron:
assignee: LIU Yulong (dragon889) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → LIU Yulong (dragon889)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/364793
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8f057fb49ac637bd0dbf60ca07b89f0e4a59c7b7
Submitter: Zuul
Branch: master

commit 8f057fb49ac637bd0dbf60ca07b89f0e4a59c7b7
Author: LIU Yulong <email address hidden>
Date: Fri Aug 19 10:16:44 2016 +0800

    DVR: Ignore DHCP port during DVR host query

    For large scale deployment, the dvr router will be installed to
    the scheduled DHCP host. This will definitely increase the l3
    agent service pressure, especially in large number of concurrent
    updates, creation, or agent restart.

    This patch adds a config ``host_dvr_for_dhcp`` for the DHCP port
    device_owner filter during DVR host query. Then if we set
    ``host_dvr_for_dhcp = False``, L3-agent will not host the DVR router
    namespace in its connected networks' DHCP agent hosts.

    Closes-Bug: #1609217
    Change-Id: I53e20be9b306bf9d3b34ec6a31e3afabd5a0fd6f

Changed in neutron:
status: In Progress → Fix Released
tags: added: neutron-proactive-backport-potential
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

The fix for this has a config variable defined. I am not sure if this is a backport potential.

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.0.0b1

This issue was fixed in the openstack/neutron 15.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/700726

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/700955

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/701068

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/stein)

Change abandoned by norman shen (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/700955

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/700955
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=50fdc6505f22758e6d0d4ccf01b3c13ca2a93153
Submitter: Zuul
Branch: stable/stein

commit 50fdc6505f22758e6d0d4ccf01b3c13ca2a93153
Author: LIU Yulong <email address hidden>
Date: Fri Aug 19 10:16:44 2016 +0800

    DVR: Ignore DHCP port during DVR host query

    For large scale deployment, the dvr router will be installed to
    the scheduled DHCP host. This will definitely increase the l3
    agent service pressure, especially in large number of concurrent
    updates, creation, or agent restart.

    This patch adds a config ``host_dvr_for_dhcp`` for the DHCP port
    device_owner filter during DVR host query. Then if we set
    ``host_dvr_for_dhcp = False``, L3-agent will not host the DVR router
    namespace in its connected networks' DHCP agent hosts.

    Closes-Bug: #1609217
    Change-Id: I53e20be9b306bf9d3b34ec6a31e3afabd5a0fd6f
    (cherry picked from commit 8f057fb49ac637bd0dbf60ca07b89f0e4a59c7b7)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/701068
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cac0e385ee7c1dfef2687c1465bf07a8db757104
Submitter: Zuul
Branch: stable/queens

commit cac0e385ee7c1dfef2687c1465bf07a8db757104
Author: LIU Yulong <email address hidden>
Date: Fri Aug 19 10:16:44 2016 +0800

    DVR: Ignore DHCP port during DVR host query

    For large scale deployment, the dvr router will be installed to
    the scheduled DHCP host. This will definitely increase the l3
    agent service pressure, especially in large number of concurrent
    updates, creation, or agent restart.

    This patch adds a config ``host_dvr_for_dhcp`` for the DHCP port
    device_owner filter during DVR host query. Then if we set
    ``host_dvr_for_dhcp = False``, L3-agent will not host the DVR router
    namespace in its connected networks' DHCP agent hosts.

    Conflicts:
     neutron/db/dvr_mac_db.py

    Closes-Bug: #1609217
    Change-Id: I53e20be9b306bf9d3b34ec6a31e3afabd5a0fd6f
    (cherry picked from commit 8f057fb49ac637bd0dbf60ca07b89f0e4a59c7b7)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.1.0

This issue was fixed in the openstack/neutron 14.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/700726
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4f31acd5658fd7fef97950d534e57e9c3c5e9558
Submitter: Zuul
Branch: stable/rocky

commit 4f31acd5658fd7fef97950d534e57e9c3c5e9558
Author: LIU Yulong <email address hidden>
Date: Fri Aug 19 10:16:44 2016 +0800

    DVR: Ignore DHCP port during DVR host query

    For large scale deployment, the dvr router will be installed to
    the scheduled DHCP host. This will definitely increase the l3
    agent service pressure, especially in large number of concurrent
    updates, creation, or agent restart.

    This patch adds a config ``host_dvr_for_dhcp`` for the DHCP port
    device_owner filter during DVR host query. Then if we set
    ``host_dvr_for_dhcp = False``, L3-agent will not host the DVR router
    namespace in its connected networks' DHCP agent hosts.

    Closes-Bug: #1609217
    Change-Id: I53e20be9b306bf9d3b34ec6a31e3afabd5a0fd6f
    (cherry picked from commit 8f057fb49ac637bd0dbf60ca07b89f0e4a59c7b7)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.7

This issue was fixed in the openstack/neutron 13.0.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron queens-eol

This issue was fixed in the openstack/neutron queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.