Created subnets on ONE network that duplicated CIDR in case of neutron server active-active

Bug #1532695 reported by Nam
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Nam

Bug Description

I had three controllers. I found a bug. I can create subnets on one network that duplicated CIDR range AT SAME TIMING

How to reproduce:

Topology : http://codepad.org/ff0debPB

Step 1: Create a network
$ neutron net-create test-net

Step 2: Create multiple subnets that duplicated CIDR scope
Please running commands AT SAME TIMMING
 - On controller1:
$ neutron subnet-create --name test-subnet1 test-net 192.168.100.0/24
 - On controller2:
$ neutron subnet-create --name test-subnet2 test-net 192.168.100.0/24

After check subnet-list:

Running command: $ neutron subnet-list
This is the result: http://codepad.org/306Vi90d

Running command: $ neutron net-list
This is the result: http://codepad.org/AcYRLP4b

After check database:
This is the result: http://codepad.org/4qRC229P

I think. In originally, one command on controller will be fail and we catch a message as following: "Invalid input for operation: Requested subnet with cidr: 192.168.100.0/24 for network: 39cc0850-1eeb-4c85-bcdc-338a3f1461aa overlaps with another subnet.". But currently, two commands are success

Nam (namnh)
Changed in neutron:
assignee: nobody → Nam (namnh)
description: updated
Nam (namnh)
description: updated
Nam (namnh)
description: updated
description: updated
Nam (namnh)
description: updated
Nam (namnh)
description: updated
Nam (namnh)
description: updated
Nam (namnh)
summary: - Created subnets on one network that duplicated CIDR in case of neutron
+ Created subnets on ONE network that duplicated CIDR in case of neutron
server active-active
Nam (namnh)
description: updated
Nam (namnh)
description: updated
Nam (namnh)
description: updated
Revision history for this message
Assaf Muller (amuller) wrote :

Confirmed on devstack all-in-one setup with one neutron-server and api_workers > 0, which is the default devstack setup.

http://paste.openstack.org/show/483424/

It failed once (As expected) with a duplicate CIDR error, but the second time I ran it I didn't get an error and had the same CIDR twice on the same network.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
importance: High → Medium
tags: added: l3-ipam-dhcp
Nam (namnh)
Changed in neutron:
assignee: Nam (namnh) → nobody
lane (lane-l)
Changed in neutron:
assignee: nobody → lane (lane-l)
Revision history for this message
Nam (namnh) wrote :

Hi Assaf Muller. Thank you for your check. I would like to explain to more detail about my case:

- First: I build three controller nodes by Devstack. It has services include: nova, neutron, glance, keystone, horizon. After done, I turn off all of services.

- Second: I configure MariaBD Galera to cluster database, configuration to sync time.

- Third: I configure pacemaker to create VIP (Virtual IP) and setup haproxy to load balancing requests and configure cluster Rabbitmp

- Finally: I edit the files configure of services in Openstack so that they will point to VIP then I change the endpoint of services to VIP in the keystone database.

Next step: I run step by step as I mentioned as above and the I received result as I reported.

Do you have suggest how to fix it.

Nam (namnh)
Changed in neutron:
assignee: lane (lane-l) → Nam (namnh)
assignee: Nam (namnh) → nobody
Revision history for this message
Nam (namnh) wrote :

Ohh Sorry land, Because I'm newbie so I dont understand mean tab "Assigned to". Now I understood

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

One work-around to consider is to use subnetpools exclusively. They are pretty robust in preventing overlapping subnets.

Normally, we rely on database constraints to avoid duplicate entries. However, this is difficult to do with subnet allocation because they can vary in size and overlap without being exactly equal.

For example, you might try to add 192.160.0.0/24 and 192.168.0.192/27. These overlap and we try not to let this happen. But, how do you get the DB to handle this with a constraint? I don't know how to do this all within the DB where I'd ideally like it to be constrained [2]. If only SQL had some sort of unique range constraint where a row could allocate as much of an available range as it wanted as long as no other row's range overlapped [3].

We had to think about this in the subnet allocation code in subnet_alloc.py [1]. Take a look at the _lock_subnetpool method. We iterated a fair amount on this. I'm fairly confident that we ended up with something that would disallow overlapping subnets within a subnetpool.

However, this strategy wasn't applied to the legacy subnet code because we are a lot more loose about how we allow overlap in this IP space. It is really only within the context of a Network or a Router that we check for overlap in the legacy code.

That said, I think the same kind of strategy could be applied to fix this bug. But, it must be applied a bit differently taking in to consideration the different scope where address overlap should be prevented.

[1] https://github.com/openstack/neutron/blob/68276dc9614d47d028e46c35bf62668e253af18b/neutron/ipam/subnet_alloc.py#L44
[2] Actually, there is a way but it could be expensive and ugly. First, you define a minimum size subnet, say /30 or /32. Then you break each subnet in to all the fragments of that size that make up the larger subnet.

For example, 192.168.0.192/27 could break up in to 192.168.0.192/30, 192.168.0.196/30, ..., 192.168.0.220/30 and insert a row in to the DB for each. That table would have a unique constraint for (network_id, subnet_fragment).

The problem is that a small fragment size ends up creating many many DB entries (especially with ipv6) and people complain that a large fragment size is too restrictive.
[3] This isn't the first time I've thought of this before. I've even wanted multi-dimensional range queries before but I don't know of any such feature available in sql.

Revision history for this message
Assaf Muller (amuller) wrote :

@Carl, if solving this via solely the DB is difficult (And I understand why it is), it sounds like a good candidate for application level distributed locking. I was there for the Tokyo design session where it was decided to use a library called Tooz [1], which as I understand is an abstraction layer over distributed locking mechanisms. The default that was chosen in that session is Zookeeper. We could lock the method that writes a CIDR to a DB, the lock name would be composed of the method name + the network_id to solve this particular bug.

[1] https://pypi.python.org/pypi/tooz

Nam (namnh)
Changed in neutron:
assignee: nobody → Nam (namnh)
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

@Assaf, I don't think we need to go that far. Take a look at [1]. I think something similar could be done. Like I said, we solved this within the context of a subnetpool and the solution could be extended to the context of a Network (this bug) or a router.

My point was that we can't just use a traditional DB constraint for this. A little extra work needs to be done to lock and synchronize between workers around the network.

[1] https://github.com/openstack/neutron/blob/68276dc9614d47d028e46c35bf62668e253af18b/neutron/ipam/subnet_alloc.py#L44

Revision history for this message
Nam (namnh) wrote :

Hi Assaf and Carl Baldwin. Thank you so much for your comments. Please could you review my ideal about fixing this bug.
Link: https://review.openstack.org/#/c/267470/

tags: added: needs-attention
Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/314054

Changed in neutron:
assignee: Nam (namnh) → Mike Bayer (zzzeek)
Changed in neutron:
assignee: Mike Bayer (zzzeek) → Nam (namnh)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/350953

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/350953
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5264ab966d3db7b1cb698190872cbe6feaf48464
Submitter: Jenkins
Branch: master

commit 5264ab966d3db7b1cb698190872cbe6feaf48464
Author: Nam Nguyen Hoai <email address hidden>
Date: Fri Aug 5 09:46:43 2016 +0700

    Using revision_number to ensure no overlap in *one* network

    This patch uses revision_number in database. When creating
    a subnet in a network, the revision_number of the network
    will be increased. That will prevent overlapping CIDR
    (overlapping CIDR means some subnets' cidrs are overlapping)
    on *one* network.

    Basically, in case of concurrent requests creating subnets
    on *one* network, only one request successes, other requests
    needs retrying request.

    Change-Id: Id6548535075bed87a4b36e1462db546ab9163f29
    Closes-Bug: #1532695

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Nam Nguyen Hoai (<email address hidden>) on branch: master
Review: https://review.openstack.org/267470

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0b3

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Nam Nguyen Hoai (<email address hidden>) on branch: master
Review: https://review.openstack.org/314054
Reason: This bug can be fixed by this patch set [1]

[1] https://review.openstack.org/#/c/350953/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.