machine unit connects to apiserver but stays in agent-state: pending

Bug #1393444 reported by JuanJo Ciarlante
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Invalid
Medium
Unassigned
juju-core (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

FYI this is the same environment from lp#1392810 (1.18->1.19->1.20),
juju version: 1.20.11-trusty-amd64

New units deployed (to LXC over maas) stay at "agent-state: pending":
http://paste.ubuntu.com/9057045/

#1 TCP connects ok to node0:17070
- at the unit:
ubuntu@juju-machine-18-lxc-5:~$ netstat -tn
tcp 0 0 x.x.x.167:57937 x.x.x.8:17070 ESTABLISHED

- at node0:
ubunte@node0:~$ sudo netstat -tnp|grep 167
tcp6 0 3807 x.x.x.8:17070 x.x.x.167:57937 ESTABLISHED 1993/jujud

Interesting there is that node0's socket tcp receive queue (3807 bytes)
is not being read by jujud.

#2 machine-0.log:
- nothing shows at unit's connection time
(ie restart jujud-machine-18-lxc-5)

- after 4~5minutes, connection drops, and this is logged:
2014-11-17 14:28:56 ERROR juju.state.apiserver.common resource.go:102 error stopping *apiserver.pingTimeout resource: ping timeout

JuanJo Ciarlante (jjo)
tags: added: canonical-bootstack
tags: added: canonical-is
summary: - machine unit connects to apiserver but doesn't deploy service
+ machine unit connects to apiserver but stays in agent-state: pending
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: upgrade-juju
tags: added: lxc
Changed in juju-core:
milestone: none → 1.22
Revision history for this message
JuanJo Ciarlante (jjo) wrote :

/var/log/juju/machine-18-lxc-5.log: http://paste.ubuntu.com/9057287/
NOTE there the repeated log stanzas are because of my manual restarts.

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

strace at both sides (grepped for specific sockets): http://paste.ubuntu.com/9057691/,
mind the subsecond date diff.

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

This deployment has 2 metal nodes hosting LXC units (machine:
0, 18), then 'juju deploy cs:ubuntu --to lxc:0' does ok, while
'--to lxc:18' was consistently failing as described above.

FYI I've worked around this by removing machine 18 down to
'maas ready' and reacquiring it from juju, now all new LXC
units there behave normally.

IMO still worth digging what state bits left there for that
machine were triggering this issue, copied a juju backup
tarball to ~natefinch in case this is feasible.

Changed in juju-core:
milestone: 1.22-alpha1 → 1.23
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.23 → none
importance: High → Medium
Revision history for this message
JuanJo Ciarlante (jjo) wrote :

@sinzui: closing this as invalid, as I later confirmed this to be a MTU issue.

Changed in juju-core:
status: Triaged → Invalid
Changed in juju-core (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.