[2.0a1] Rack controller unable to register with regiond

Bug #1553617 reported by Mark Shuttleworth
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Unassigned

Bug Description

I have a non-functional MAAS 2.0a1 and I believe the reason is that the rack controller is failing to register with the region controller. I see a lot of this in the regiond.log:

2016-03-05 23:16:50 [RegionServer,4387,127.0.0.1] Rack controller 'None' disconnected.
2016-03-05 23:16:50 [RegionServer,4387,127.0.0.1] RegionServer connection lost (HOST:IPv4Address(TCP, '127.0.0.1', 5250) PEER:IPv4Address(TCP, '127.0.0.1', 36914))
2016-03-05 23:16:50 [-] Failed to register rack controller 'None' into the database. Connection has been dropped.
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
     _inlineCallbacks(r, g, deferred)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python3/dist-packages/maasserver/rpc/regionservice.py", line 523, in register
     log.err(exc, msg)
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/maasserver/rpc/regionservice.py", line 481, in register
     nodegroup_uuid=nodegroup_uuid)
   File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 246, in inContext
     result = inContext.theWork()
   File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 262, in <lambda>
     inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 197, in wrapper
     return func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/maasserver/utils/orm.py", line 448, in call_within_transaction
     return func_outside_txn(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/maasserver/utils/orm.py", line 275, in retrier
     return func(*args, **kwargs)
   File "/usr/lib/python3.5/contextlib.py", line 30, in inner
     return func(*args, **kwds)
   File "/usr/lib/python3/dist-packages/maasserver/rpc/rackcontrollers.py", line 56, in register_rackcontroller
     rackcontroller.update_interfaces(interfaces) # Calls save.
   File "/usr/lib/python3/dist-packages/maasserver/models/node.py", line 2932, in update_interfaces
     interface = self._update_interface(name, interfaces[name])
   File "/usr/lib/python3/dist-packages/maasserver/models/node.py", line 2964, in _update_interface
     return self._update_bond_interface(name, config)
   File "/usr/lib/python3/dist-packages/maasserver/models/node.py", line 3156, in _update_bond_interface
     interface.vlan = parent_nics[0].vlan
 builtins.IndexError: list index out of range

Related branches

Revision history for this message
Mike Pontillo (mpontillo) wrote :

From the traceback, it looks like MAAS was unable to determine which interfaces make up your bond interface.

To confirm, it would be helpful if you could let us know the output of the following commands, when run from the rack controller:

find -H /sys/class/net/* | grep bond

python3 -c "from provisioningserver.networks import get_interfaces_definition; from pprint import pprint; pprint(get_interfaces_definition()[0])"

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.0.0
Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

OK, attaching the bond sysfs output.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

And the output from the Python command

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

The network on my MAAS server is a little unusual in that it has two bonds, one of which does not have an IP address allocated.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

It seems that on your system, (while "bond-backup" may be ready-to-configure if needed) the currently active configuration is that *zero* interfaces are in the bond. (perhaps ifupdown does not configure them as bond slaves until you run "ifup" on the interface.)

MAAS did not account for this edge case, and has made the assumption that all bonds that exist require at least one backing interface. It's possible that if we had parsed the contents of /etc/network/interfaces, we could have determined which parents were in the bond, which would have avoided this issue. (We have code to do this, but it was excluded from the alpha for reliability reasons.)

Proposal: MAAS should ignore bond interfaces which do not have parents. (After MAAS once again parses /e/n/i, you'll at least see "bond-backup" in the UI, assuming it is configured with at least one parent.)

Revision history for this message
Mike Pontillo (mpontillo) wrote :

If you want to get up and running quickly, you could try this (completely untested) patch:

https://paste.ubuntu.com/15304415/

Assuming you place it in /tmp/patch, it could be applied as follows:

sudo patch -p1 -d /usr/lib/python3/dist-packages -i /tmp/patch

Revision history for this message
Mark Shuttleworth (sabdfl) wrote : Re: [Bug 1553617] Re: [2.0a1] Rack controller unable to register with regiond

That's right - I had left an "auto bond-backup" but commented out the
auto stanza for each of the slaves in bond-backup. This is not a normal
situation :) I don't think MAAS should CRASH but I also don't think it
should present to the user an interface that is so dysfunctional.

So let's leave this bug to address the crash, but not add any complexity
where we guess the intended outcome of a bond with no members.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Actually, don't use that patch; I missed the "else" statement below which will cause it to fail.

I'll get you something tested. =)

Revision history for this message
Mike Pontillo (mpontillo) wrote :

FYI, this patch passes unit testing, but I haven't tested it end-to-end on a system with oddly-configured bonds yet.

https://paste.ubuntu.com/15304645/

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :
Download full text (10.3 KiB)

I've patched and rebooted to ensure that the new code is running, and still see constant errors in the logs. In the rackd.log I see:

2016-03-06 18:19:12+0000 [Uninitialized] ClusterClient connection established (HOST:IPv4Address(TCP, '127.0.0.1', 59280) PEER:IPv4Address(TCP, '127.0.0.1', 5252))
2016-03-06 18:19:12+0000 [Uninitialized] ClusterClient connection established (HOST:IPv4Address(TCP, '192.168.9.2', 60698) PEER:IPv4Address(TCP, '192.168.9.2', 5250))
2016-03-06 18:19:12+0000 [Uninitialized] ClusterClient connection established (HOST:IPv4Address(TCP, '192.168.9.2', 50788) PEER:IPv4Address(TCP, '192.168.9.2', 5253))
2016-03-06 18:19:12+0000 [Uninitialized] ClusterClient connection established (HOST:IPv4Address(TCP, '127.0.0.1', 36334) PEER:IPv4Address(TCP, '127.0.0.1', 5251))
2016-03-06 18:19:12+0000 [ClusterClient,client] Event-loop 'maas:pid=2784' authenticated.
2016-03-06 18:19:12+0000 [ClusterClient,client] Event-loop 'maas:pid=2792' authenticated.
2016-03-06 18:19:12+0000 [ClusterClient,client] Event-loop 'maas:pid=2796' authenticated.
2016-03-06 18:19:12+0000 [ClusterClient,client] Event-loop 'maas:pid=2780' authenticated.
2016-03-06 18:19:12+0000 [ClusterClient,client] Rack controller REJECTED by the region (via maas:pid=2784).
2016-03-06 18:19:12+0000 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4Address(TCP, '127.0.0.1', 59280) PEER:IPv4Address(TCP, '127.0.0.1', 5252))
2016-03-06 18:19:12+0000 [ClusterClient,client] Rack controller REJECTED by the region (via maas:pid=2780).
2016-03-06 18:19:12+0000 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4Address(TCP, '127.0.0.1', 36334) PEER:IPv4Address(TCP, '127.0.0.1', 5251))
2016-03-06 18:19:13+0000 [ClusterClient,client] Rack controller REJECTED by the region (via maas:pid=2792).
2016-03-06 18:19:13+0000 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4Address(TCP, '192.168.9.2', 60698) PEER:IPv4Address(TCP, '192.168.9.2', 5250))
2016-03-06 18:19:13+0000 [ClusterClient,client] Rack controller REJECTED by the region (via maas:pid=2796).
2016-03-06 18:19:13+0000 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4Address(TCP, '192.168.9.2', 50788) PEER:IPv4Address(TCP, '192.168.9.2', 5253))

And in the regiond.log I see:

2016-03-06 18:21:07 [RegionServer,358,192.168.9.2] Rack controller 'None' disconnected.
2016-03-06 18:21:07 [RegionServer,358,192.168.9.2] RegionServer connection lost (HOST:IPv4Address(TCP, '192.168.9.2', 5250) PEER:IPv4Address(TCP, '192.168.9.2', 34594))
2016-03-06 18:21:08 [-] 127.0.0.1 - - [06/Mar/2016:18:21:07 +0000] "GET /MAAS/rpc/ HTTP/1.0" 200 268 "-" "provisioningserver.rpc.clusterservice.ClusterClientService"
2016-03-06 18:21:08 [twisted.internet.protocol.Factory] RegionServer connection established (HOST:IPv4Address(TCP, '127.0.0.1', 5252) PEER:IPv4Address(TCP, '127.0.0.1', 33214))
2016-03-06 18:21:08 [twisted.internet.protocol.Factory] RegionServer connection established (HOST:IPv4Address(TCP, '192.168.9.2', 5250) PEER:IPv4Address(TCP, '192.168.9.2', 34632))
2016-03-06 18:21:08 [twisted.internet.protocol.Factory] RegionServer connection established (HOST:IPv4Address(TCP, '192.168.9.2', 5253) PEER...

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

It seems we're a little fragile about the rack controller registration process to the region controller. I would have thought it would be a simple two-phase process, starting on the region controller:

  $ maas-region add-rackd
  Now run "maas-rack register t0k1n-one-thyme http://192.168.3.34/" on the rack controller.

Then on the rack controller:

  $ maas-rack register t0k1n... http://192.../
  Rack controller registered to "garage" region.

At that stage, I would think the rack controller could update the region controller about its interfaces.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

I have configured a local test bed similarly, with dual bonds (one of them without any backing interfaces) and I've reproduced the issue.

My assessment: the initial fix to ignore invalid bond interfaces worked. However, it uncovered a surprising issue. It turns out that when bond interfaces are configured, each of the bond's parents (in your case, p1p1, p1p2, p2p1, and p2p2) are updated so that their MAC address matches what is configured on bond0. MAAS did not account for this, and was using the MAC address as a unique identifying characteristic of the interface. When the rack controller registers, each bond member overwrites the previous bond member. Then when MAAS goes to register the bond, only one out of its N parent interfaces are found in the database.

We didn't catch this in our unit testing because our tests didn't account for this "MAC replacement" behavior.

I like your suggestion regarding the simplicity of rack registration, but let's discuss that separately so we can keep this bug focused on the issues you've uncovered with registering racks containing bond interfaces.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

It looks like in order to solve this, we'll have to cross-reference the interface data we gather with what we find in /proc/net/bonding/*.

For example, on your system, we'll need to parse /proc/net/bonding/bond-lan to gather the *original* MAC addresses for your p?p? interfaces.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Thanks Mike. In general we're going to run into some awkwardness if we
use mac addresses as a key. Mac addresses can move about. There's the
bonding thing, and I think quite a few things futz with mac addresses as
well, like macvlan's etc. And of course people can move network cards
around just for fun, too :) I don't have any off-the-cuff idea for a
better identifier of the hardware, but it bears thinking about.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

It's interesting that you mention MACs moving about; we actually use this assumption to determine when someone has moved a NIC between machines. (we'll notice this - and log it - if we see a MAC move during commissioning.)

In general, I don't think we assume that MACs are globally unique for all interface types, but we *do* make the assumption for physical interfaces, which is why we saw this second issue. You're right that we should watch out for other cases like this which may lead to duplicate MACs. (For example, we may currently treat bridges as "physical" interfaces, in which case we might see a similar issue if someone configures a bridge with same MAC as their physical interface.)

Here's a preview of a change that parses /proc/net/bonding/* and adjusts the physical interface MACs back to their original values:

https://paste.ubuntu.com/15317298/

I tested it on the local test bed I used to reproduce the issue, and it allowed the rack to register successfully! If you want to try it, just replace the contents of /usr/lib/python3/dist-packages/provisioningserver/utils/ipaddr.py with the contents of that pastebin. (I figured it's easier that way, since I already had you patch that file once.)

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Yes, confirmed fixed, thanks Mike :)

Will keep testing now. Really appreciate the digging and being unblocked!

Mark

Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
FanFan (fkpwolf) wrote :

I has same issue and just come cross this topic. For my case, I found root cause is that maas use interface + mac addresss to bind/create "fabric". If I change to another mac addr(I use kvm as maas machine) but keep interface name, maas will can't add a rack controller.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.