OpenStack HA Cluster Charm

Bug #1740892
Comment #16

Comment 16 for bug 1740892

Revision history for this message

Nish Aravamudan (nacc) wrote on 2018-01-08: Re: [Bug 1740892] Re: corosync upgrade on 2018-01-02 caused pacemaker to fail

#16

On Mon, Jan 8, 2018 at 10:04 AM, Nish Aravamudan
<email address hidden> wrote:
> On Mon, Jan 8, 2018 at 9:51 AM, Nish Aravamudan
> <email address hidden> wrote:
>> On Mon, Jan 8, 2018 at 8:48 AM, Victor Tapia <email address hidden> wrote:
>>> As mentioned by Mario @ #10, stopping corosync while pacemaker runs
>>> throws the same error as the upgrade. Syslog from Xenial +
>>> corosync=2.3.5-3ubuntu1:
>>>
>>> Jan 8 16:24:37 xenial-corosync systemd[1]: Stopping Pacemaker High Availability Cluster Manager...
>>> Jan 8 16:24:37 xenial-corosync pacemakerd[28747]: notice: Invoking handler for signal 15: Terminated
>>> Jan 8 16:24:37 xenial-corosync crmd[28753]: notice: Invoking handler for signal 15: Terminated
>>> Jan 8 16:24:37 xenial-corosync crmd[28753]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
>>> Jan 8 16:24:37 xenial-corosync pengine[28752]: notice: Delaying fencing operations until there are resources to manage
>>> Jan 8 16:24:37 xenial-corosync pengine[28752]: notice: Scheduling Node xenial-corosync for shutdown
>>> Jan 8 16:24:37 xenial-corosync pengine[28752]: notice: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-52.bz2
>>> Jan 8 16:24:37 xenial-corosync crmd[28753]: notice: Transition 1 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-52.bz2): Complete
>>> Jan 8 16:24:37 xenial-corosync crmd[28753]: notice: Disconnecting from Corosync
>>> Jan 8 16:24:37 xenial-corosync cib[28748]: warning: new_event_notification (28748-28753-12): Broken pipe (32)
>>> Jan 8 16:24:37 xenial-corosync pengine[28752]: notice: Invoking handler for signal 15: Terminated
>>> Jan 8 16:24:37 xenial-corosync attrd[28751]: notice: Invoking handler for signal 15: Terminated
>>> Jan 8 16:24:37 xenial-corosync lrmd[28750]: notice: Invoking handler for signal 15: Terminated
>>> Jan 8 16:24:37 xenial-corosync stonith-ng[28749]: notice: Invoking handler for signal 15: Terminated
>>> Jan 8 16:24:37 xenial-corosync cib[28748]: notice: Invoking handler for signal 15: Terminated
>>> Jan 8 16:24:37 xenial-corosync cib[28748]: notice: Disconnecting from Corosync
>>> Jan 8 16:24:37 xenial-corosync cib[28748]: notice: Disconnecting from Corosync
>>> Jan 8 16:24:37 xenial-corosync systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
>>>
>>>
>>> Pacemakerd shuts down sending SIGTERM to its components, but after the install, corosync does not start pacemaker. BTW, "systemctl restart corosync" restarts both services perfectly
>>>
>>> I think that the option A from James Page (#11) is the way to go
>>
>> I took a quick look at a LXD container after seeing Felipe and
>> Victor's posts. It seems like this is a bug in the xenial (at least)
>> systemd unit files:
>>
>> # grep pacemaker /lib/systemd/system/corosync.service
>> # pacemaker.service, and if you want to exert the watchdog when a
>>
>> # grep corosync /lib/systemd/system/pacemaker.service
>> After=corosync.service
>> Requires=corosync.service
>> # ExecStopPost=/bin/sh -c 'pidof crmd || killall -TERM corosync'
>>
>> So, what I see is that corosync.service has no dependency on
>> pacemaker.service (in the file).
>>
>> pacemaker.service will start after corosync.service. And when
>> pacemaker.service is shutdown it will be before corosync.service.
>> Additionally, if pacemaker.service is started, then corosync.service
>> is started as well.
>>
>> Note, nothing specifies what Felipe said -- there is no guarantee that
>> pacemaker is started, restarted, etc. when corosync is.
>>
>> I think the next step is to look at Bionic's systemd services
>> (probably newer) or upstream's and see if there is a difference, or
>> new dependencies added there.
>
> Or perhaps ask upstream what they think is providing this assurance in
> their systemd files, because I'm not seeing it.
>
> If we have a hard dependency between pacemaker and corosync, then I
> think we might need a PartOf directive, in order to ensure they are
> always following the state transitions together.

Or if that is bad (because it does feel like a layering violation and
maybe it makes sense to have either pacemaker or corosync installed
with the other), the pacemaker.service should says

WantedBy=corosync.service

That will ensure that when corosync.service starts, pacemaker.service
starts. The Requires line ensures that when corosync.service stops,
pacemaker stops (with the order specified by the After).

I think :)

On Mon, Jan 8, 2018 at 10:04 AM, Nish Aravamudan
<nish.aravamudan@canonical.com> wrote:
> On Mon, Jan 8, 2018 at 9:51 AM, Nish Aravamudan
> <nish.aravamudan@canonical.com> wrote:
>> On Mon, Jan 8, 2018 at 8:48 AM, Victor Tapia <victor.tapia@canonical.com> wrote:
>>> As mentioned by Mario @ #10, stopping corosync while pacemaker runs
>>> throws the same error as the upgrade. Syslog from Xenial +
>>> corosync=2.3.5-3ubuntu1:
>>>
>>> Jan  8 16:24:37 xenial-corosync systemd[1]: Stopping Pacemaker High Availability Cluster Manager...
>>> Jan  8 16:24:37 xenial-corosync pacemakerd[28747]:   notice: Invoking handler for signal 15: Terminated
>>> Jan  8 16:24:37 xenial-corosync crmd[28753]:   notice: Invoking handler for signal 15: Terminated
>>> Jan  8 16:24:37 xenial-corosync crmd[28753]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
>>> Jan  8 16:24:37 xenial-corosync pengine[28752]:   notice: Delaying fencing operations until there are resources to manage
>>> Jan  8 16:24:37 xenial-corosync pengine[28752]:   notice: Scheduling Node xenial-corosync for shutdown
>>> Jan  8 16:24:37 xenial-corosync pengine[28752]:   notice: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-52.bz2
>>> Jan  8 16:24:37 xenial-corosync crmd[28753]:   notice: Transition 1 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-52.bz2): Complete
>>> Jan  8 16:24:37 xenial-corosync crmd[28753]:   notice: Disconnecting from Corosync
>>> Jan  8 16:24:37 xenial-corosync cib[28748]:  warning: new_event_notification (28748-28753-12): Broken pipe (32)
>>> Jan  8 16:24:37 xenial-corosync pengine[28752]:   notice: Invoking handler for signal 15: Terminated
>>> Jan  8 16:24:37 xenial-corosync attrd[28751]:   notice: Invoking handler for signal 15: Terminated
>>> Jan  8 16:24:37 xenial-corosync lrmd[28750]:   notice: Invoking handler for signal 15: Terminated
>>> Jan  8 16:24:37 xenial-corosync stonith-ng[28749]:   notice: Invoking handler for signal 15: Terminated
>>> Jan  8 16:24:37 xenial-corosync cib[28748]:   notice: Invoking handler for signal 15: Terminated
>>> Jan  8 16:24:37 xenial-corosync cib[28748]:   notice: Disconnecting from Corosync
>>> Jan  8 16:24:37 xenial-corosync cib[28748]:   notice: Disconnecting from Corosync
>>> Jan  8 16:24:37 xenial-corosync systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
>>>
>>>
>>> Pacemakerd shuts down sending SIGTERM to its components, but after the install, corosync does not start pacemaker. BTW, "systemctl restart corosync" restarts both services perfectly
>>>
>>> I think that the option A from James Page (#11) is the way to go
>>
>> I took a quick look at a LXD container after seeing Felipe and
>> Victor's posts. It seems like this is a bug in the xenial (at least)
>> systemd unit files:
>>
>> # grep pacemaker /lib/systemd/system/corosync.service
>> #  pacemaker.service, and if you want to exert the watchdog when a
>>
>> # grep corosync /lib/systemd/system/pacemaker.service
>> After=corosync.service
>> Requires=corosync.service
>> # ExecStopPost=/bin/sh -c 'pidof crmd || killall -TERM corosync'
>>
>> So, what I see is that corosync.service has no dependency on
>> pacemaker.service (in the file).
>>
>> pacemaker.service will start after corosync.service. And when
>> pacemaker.service is shutdown it will be before corosync.service.
>> Additionally, if pacemaker.service is started, then corosync.service
>> is started as well.
>>
>> Note, nothing specifies what Felipe said -- there is no guarantee that
>> pacemaker is started, restarted, etc. when corosync is.
>>
>> I think the next step is to look at Bionic's systemd services
>> (probably newer) or upstream's and see if there is a difference, or
>> new dependencies added there.
>
> Or perhaps ask upstream what they think is providing this assurance in
> their systemd files, because I'm not seeing it.
>
> If we have a hard dependency between pacemaker and corosync, then I
> think we might need a PartOf directive, in order to ensure they are
> always following the state transitions together.

Or if that is bad (because it does feel like a layering violation and
maybe it makes sense to have either pacemaker or corosync installed
with the other), the pacemaker.service should says

WantedBy=corosync.service

That will ensure that when corosync.service starts, pacemaker.service
starts. The Requires line ensures that when corosync.service stops,
pacemaker stops (with the order specified by the After).

I think :)