[bionic] fence_scsi not working properly with Pacemaker 1.1.18-2ubuntu1.1

Bug #1865523 reported by Rafael David Tinoco
22
This bug affects 1 person
Affects Status Importance Assigned to Milestone
fence-agents (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Medium
Rafael David Tinoco
Disco
Won't Fix
Medium
Rafael David Tinoco
Eoan
Fix Released
Undecided
Unassigned

Bug Description

OBS: I have split this bug into 2 bugs:
     - fence-agents (this) and pacemaker (LP: #1866119)

#### SRU: fence-agents

[Impact]

 * fence_scsi is not currently working in a share disk environment

 * all clusters relying in fence_scsi and/or fence_scsi + watchdog won't be able to start the fencing agents OR, in worst case scenarios, the fence_scsi agent might start but won't make scsi reservations in the shared scsi disk.

[Test Case]

 * having a 3-node setup, nodes called "clubionic01, clubionic02, clubionic03", with a shared scsi disk (fully supporting persistent reservations) /dev/sda, one might try the following command:

sudo fence_scsi --verbose -n clubionic01 -d /dev/sda -k 3abe0000 -o off

from nodes "clubionic02 or clubionic03" and check if the reservation worked:

(k)rafaeldtinoco@clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x0, there are NO registered reservation keys

(k)rafaeldtinoco@clubionic02:~$ sudo sg_persist -r /dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x0, there is NO reservation held

 * having a 3-node setup, nodes called "clubionic01, clubionic02, clubionic03", with a shared scsi disk (fully supporting persistent reservations) /dev/sda, with corosync and pacemaker operational and running, one might try:

rafaeldtinoco@clubionic01:~$ crm configure
crm(live)configure# property stonith-enabled=on
crm(live)configure# property stonith-action=off
crm(live)configure# property no-quorum-policy=stop
crm(live)configure# property have-watchdog=true
crm(live)configure# property symmetric-cluster=true
crm(live)configure# commit
crm(live)configure# end
crm(live)# end

rafaeldtinoco@clubionic01:~$ crm configure primitive fence_clubionic \
    stonith:fence_scsi params \
    pcmk_host_list="clubionic01 clubionic02 clubionic03" \
    devices="/dev/sda" \
    meta provides=unfencing

And see that crm_mon won't show fence_clubionic resource operational.

[Regression Potential]

 * Fix involves adding new cmdline and stdin arguments to the fencing agents. Both changes in that direction (normalizing "-" with "_" and deprecating some commands in favor of others) keep the existing commands working and allow the new commands to work as well (that part is the fix, because of the integration with pacemaker).

 * Comments #3 and #4 show this new version fully working.

 * This is a quite complex change and I'd appreciate leaving it in -proposed for a
while longer (15 days ?) for a higher chance to detect issues. Furthermore there was no update since bionic release, so users could in the worst-case (and only then)
report a bug and downgrade to the former version.

 * Judging by this issue, it is very likely that any Ubuntu user that have tried using fence_scsi has probably migrated to a newer version because fence_scsi agent is broken since its release.

[Other Info]

 * The way I fixed fence_scsi was this:

I packaged pacemaker in latest 1.1.X version and kept it "vanilla" so I could bisect fence-agents. At that moment I realized that bisecting was going to be hard because there were multiple issues, not only one. I backported the latest fence-agents together with Pacemaker 1.1.19-0 and saw that it worked.

From then on, I bisected the following intervals:

4.3.0 .. 4.4.0 (eoan - working)
4.2.0 .. 4.3.0
4.1.0 .. 4.2.0
4.0.25 .. 4.1.0 (bionic - not working)

In each of those intervals I discovered issues. For example, Using 4.3.0 I faced problems so I had to backport fixes that were in between 4.4.0 and 4.3.0. Then, backporting 4.2.0, I faced issues so I had to backport fixes from the 4.3.0 <-> 4.2.0 interval. I did this until I was at 4.0.25 version, current Bionic fence-agents version.

 * Original Description:

Trying to setup a cluster with an iscsi shared disk, using fence_scsi as the fencing mechanism, I realized that fence_scsi is not working in Ubuntu Bionic. I first thought it was related to Azure environment (LP: #1864419), where I was trying this environment, but then, trying locally, I figured out that somehow pacemaker 1.1.18 is not fencing the shared scsi disk properly.

Note: I was able to "backport" vanilla 1.1.19 from upstream and fence_scsi worked. I have then tried 1.1.18 without all quilt patches and it didnt work as well. I think that bisecting 1.1.18 <-> 1.1.19 might tell us which commit has fixed the behaviour needed by the fence_scsi agent.

(k)rafaeldtinoco@clubionic01:~$ crm conf show
node 1: clubionic01.private
node 2: clubionic02.private
node 3: clubionic03.private
primitive fence_clubionic stonith:fence_scsi \
        params pcmk_host_list="10.250.3.10 10.250.3.11 10.250.3.12" devices="/dev/sda" \
        meta provides=unfencing
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.18-2b07d5c5a9 \
        cluster-infrastructure=corosync \
        cluster-name=clubionic \
        stonith-enabled=on \
        stonith-action=off \
        no-quorum-policy=stop \
        symmetric-cluster=true

----

(k)rafaeldtinoco@clubionic02:~$ sudo crm_mon -1
Stack: corosync
Current DC: clubionic01.private (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Mon Mar 2 15:55:30 2020
Last change: Mon Mar 2 15:45:33 2020 by root via cibadmin on clubionic01.private

3 nodes configured
1 resource configured

Online: [ clubionic01.private clubionic02.private clubionic03.private ]

Active resources:

 fence_clubionic (stonith:fence_scsi): Started clubionic01.private

----

(k)rafaeldtinoco@clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x0, there are NO registered reservation keys

(k)rafaeldtinoco@clubionic02:~$ sudo sg_persist -r /dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x0, there is NO reservation held

Related branches

Changed in pacemaker (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in pacemaker (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
importance: Medium → Undecided
status: Confirmed → Triaged
no longer affects: pacemaker (Ubuntu Focal)
no longer affects: pacemaker (Ubuntu Eoan)
no longer affects: pacemaker (Ubuntu Disco)
Changed in pacemaker (Ubuntu):
status: Triaged → Fix Released
Changed in pacemaker (Ubuntu Bionic):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in fence-agents (Ubuntu Eoan):
status: New → Fix Released
Changed in fence-agents (Ubuntu Focal):
status: New → Fix Released
Changed in fence-agents (Ubuntu Disco):
status: New → Confirmed
Changed in fence-agents (Ubuntu Bionic):
status: New → Confirmed
importance: Undecided → Medium
Changed in fence-agents (Ubuntu Disco):
importance: Undecided → Medium
Changed in pacemaker (Ubuntu Bionic):
importance: High → Medium
Changed in fence-agents (Ubuntu Bionic):
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in fence-agents (Ubuntu Disco):
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

To make this work I had to use pacemaker from upstream (Vanilla) version: 1.1.19-0

$ dpkg -l | grep 1.1.19 | awk '{print $2" "$3}'
libcib4:amd64 1.1.19-0ubuntu1
libcrmcluster4:amd64 1.1.19-0ubuntu1
libcrmcommon3:amd64 1.1.19-0ubuntu1
libcrmservice3:amd64 1.1.19-0ubuntu1
liblrmd1:amd64 1.1.19-0ubuntu1
libpe-rules2:amd64 1.1.19-0ubuntu1
libpe-status10:amd64 1.1.19-0ubuntu1
libpengine10:amd64 1.1.19-0ubuntu1
libstonithd2:amd64 1.1.19-0ubuntu1
libtransitioner2:amd64 1.1.19-0ubuntu1
pacemaker 1.1.19-0ubuntu1
pacemaker-cli-utils 1.1.19-0ubuntu1
pacemaker-common 1.1.19-0ubuntu1
pacemaker-doc 1.1.19-0ubuntu1
pacemaker-resource-agents 1.1.19-0ubuntu1

AND fence-agents from Ubuntu Eoan:

fence-agents 4.2.1-1

Only after that "combination" I was able to make fence_scsi agent to work:

(k)rafaeldtinoco@clubionic01:~$ crm conf show
node 1: clubionic01.private
node 2: clubionic02.private
node 3: clubionic03.private
primitive fence_clubionic stonith:fence_scsi \
        params pcmk_host_list="clubionic01.private clubionic02.private clubionic03.private" devices="/dev/sda" \
        meta provides=unfencing
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.19-1.1.19 \
        cluster-infrastructure=corosync \
        cluster-name=clubionic \
        stonith-enabled=true \
        stonith-action=off \
        no-quorum-policy=stop

with proper reservations being made:

(k)rafaeldtinoco@clubionic03:~$ sudo sg_persist --in --read-keys --device=/dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x4, 3 registered reservation keys follow:
    0x3abe0002
    0x3abe0000
    0x3abe0001

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

# For Ubuntu Bionic:

Okay, after bisecting fence-scsi and monitoring all its functions I was able to isolate the patches that I need to take to bionic to make it compatible with existing version and, at the same time, operational:

Note: all tests were conducted with Pacemaker v1.1.19-0ubuntu1 and this is not the default in Ubuntu Bionic. I have maintained "vanilla" Pacemaker v1.1.19 in order to better isolate all fixes for fence-agents. Now I'm able to create a fixed fence-agent package for Ubuntu Bionic AND fix Pacemaker.

# Ubuntu Bionic SRU: Fence Agents v4.0.25 PLUS the following fixes/commits ordered by date:

commit 81b8370844f5aecaee5e7178d82670c70399d824
Author: Oyvind Albrigtsen <email address hidden>
Date: Mon Jul 24 14:12:15 2017

    fence_scsi: add FIPS support

commit eae9d029b7073e7eb8c7ba4df9ec19b755a8f603
Author: Oyvind Albrigtsen <email address hidden>
Date: Wed Sep 27 12:26:38 2017

    fix for ignored options

commit c6f29a653114523e9ac3644aed958b4bb43f3b41
Author: Oyvind Albrigtsen <email address hidden>
Date: Wed Sep 27 12:42:39 2017

    Maintain ABI compatibility for external agents

commit 746fd55b061aa28b27aac5a1bb38714a95812592
Author: Reid Wahl <email address hidden>
Date: Fri Apr 6 18:31:30 2018

    Low: fence_scsi: Remove period from cmd string

commit bec154345d2291c9051c16277de9054387dc9707
Author: Oyvind Albrigtsen <email address hidden>
Date: Thu Apr 19 11:30:53 2018

    fence_scsi: fix plug-parameter and keep support for nodename to avoid regressions

commit 335aca4e54e4ec46b9b5d86ef30a7d9348e6a216
Author: Valentin Vidic <email address hidden>
Date: Wed May 23 12:51:23 2018

    fence_scsi: fix python3 encoding error #206

commit f77297b654586bf539e78957f26cae1d22c6f081
Author: Oyvind Albrigtsen <email address hidden>
Date: Fri Nov 2 08:24:56 2018

    fence_scsi: fix incorrect SCSI key when node ID is 10 or higher

      The last four digits of the SCSI key will be zero padded digit between 0000-0009.

commit 1c4a64ca803831b44c96c75022abe5bb8713cd1a
Author: Oyvind Albrigtsen <email address hidden>
Date: Wed May 22 08:13:34 2019

    fence_scsi: detect node ID using new format, and fallback to old format
    before failing

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

# Demonstration of fence_scsi agent working in Bionic:

(k)rafaeldtinoco@clubionic01:~/.../upstream$ sudo dpkg -i ./*.deb
dpkg: warning: downgrading fence-agents from 4.4.0-2 to 4.0.25-2ubuntu1
(Reading database ... 85434 files and directories currently installed.)
Preparing to unpack .../fence-agents_4.0.25-2ubuntu1_amd64.deb ...
Unpacking fence-agents (4.0.25-2ubuntu1) over (4.4.0-2) ...
Preparing to unpack .../fence-agents_4.4.0-2_amd64.deb ...
Unpacking fence-agents (4.4.0-2) over (4.0.25-2ubuntu1) ...
Setting up fence-agents (4.4.0-2) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...

(k)rafaeldtinoco@clubionic01:~/.../upstream$ sudo sg_persist --in --read-keys --device=/dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x0, there are NO registered reservation keys

(k)rafaeldtinoco@clubionic01:~/.../upstream$ systemctl restart pacemaker

(k)rafaeldtinoco@clubionic02:~/.../upstream$ crm_mon -1
Stack: corosync
Current DC: clubionic03.private (version 1.1.19-1.1.19) - partition with quorum
Last updated: Tue Mar 3 21:16:04 2020
Last change: Tue Mar 3 20:25:56 2020 by root via cibadmin on clubionic01.private

3 nodes configured
1 resource configured

Online: [ clubionic01.private clubionic02.private clubionic03.private ]

Active resources:

 fence_clubionic (stonith:fence_scsi): Started clubionic01.private

(k)rafaeldtinoco@clubionic01:~/.../upstream$ sudo sg_persist --in --read-keys --device=/dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x3, 3 registered reservation keys follow:
    0x3abe0000
    0x3abe0001
    0x3abe0002

(k)rafaeldtinoco@clubionic01:~/.../upstream$ sudo sg_persist -r /dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x3, Reservation follows:
    Key=0x3abe0001
    scope: LU_SCOPE, type: Write Exclusive, registrants only

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

# Demonstration of fence_scsi fencing a node:

(k)rafaeldtinoco@clubionic03:~/.../upstream$ cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet dhcp

# iscsi
auto eth1
iface eth1 inet static
    address 10.250.1.12/24

# private
auto eth2
iface eth2 inet static
    address 10.250.3.12/24

# public
auto eth3
iface eth3 inet static
    address 10.250.98.12/24

(k)rafaeldtinoco@clubionic03:~/.../upstream$ sudo iptables -A INPUT -i eth2 -j DROP

(k)rafaeldtinoco@clubionic01:~$ crm_mon -1
Stack: corosync
Current DC: clubionic01.private (version 1.1.19-1.1.19) - partition with quorum
Last updated: Tue Mar 3 21:24:55 2020
Last change: Tue Mar 3 20:25:56 2020 by root via cibadmin on clubionic01.private

3 nodes configured
1 resource configured

Online: [ clubionic01.private clubionic02.private ]
OFFLINE: [ clubionic03.private ]

Active resources:

 fence_clubionic (stonith:fence_scsi): Started clubionic01.private

(k)rafaeldtinoco@clubionic03:~/.../upstream$ sudo sg_persist --in --read-keys --device=/dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x4, 2 registered reservation keys follow:
    0x3abe0000
    0x3abe0001

(k)rafaeldtinoco@clubionic03:~/.../upstream$ sudo dd if=/dev/zero of=/dev/sda bs=1M count=1
[ 3301.867294] print_req_error: critical nexus error, dev sda, sector 0
[ 3301.868543] Buffer I/O error on dev sda, logical block 0, lost async page write
[ 3301.869956] Buffer I/O error on dev sda, logical block 1, lost async page write
[ 3301.871430] Buffer I/O error on dev sda, logical block 2, lost async page write
[ 3301.872929] Buffer I/O error on dev sda, logical block 3, lost async page write
[ 3301.874448] Buffer I/O error on dev sda, logical block 4, lost async page write
[ 3301.875963] Buffer I/O error on dev sda, logical block 5, lost async page write
[ 3301.877486] Buffer I/O error on dev sda, logical block 6, lost async page write
[ 3301.879000] Buffer I/O error on dev sda, logical block 7, lost async page write
[ 3301.880481] Buffer I/O error on dev sda, logical block 8, lost async page write
[ 3301.882014] Buffer I/O error on dev sda, logical block 9, lost async page write
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227557 s, 46.1 MB/s

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Now I'm going to work with this package and check needed Pacemaker fixes. After that I'm going to propose both merges together.

description: updated
description: updated
no longer affects: fence-agents (Ubuntu Focal)
Changed in fence-agents (Ubuntu Bionic):
status: Confirmed → In Progress
no longer affects: pacemaker (Ubuntu)
no longer affects: pacemaker (Ubuntu Bionic)
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Disco is EOL since January 23rd, makring this Won't Fix

Changed in fence-agents (Ubuntu Disco):
status: Confirmed → Won't Fix
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

A PPA can be currently found at : https://launchpad.net/~ubuntu-server-ha/+archive/ubuntu/staging

I'm adjusting the SRU but, meanwhile, that PPA provides a working version for Ubuntu Bionic.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
summary: - [bionic] fence_scsi not working properly with 1.1.18-2ubuntu1.1
+ [bionic] fence_scsi not working properly with Pacemaker
+ 1.1.18-2ubuntu1.1
description: updated
description: updated
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Rafael, or anyone else affected,

Accepted fence-agents into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/fence-agents/4.0.25-2ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in fence-agents (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
tags: added: block-proposed-bionic
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay, I had verified this from the day in it landed in -proposed. It is working as expected (https://discourse.ubuntu.com/t/ubuntu-high-availability-corosync-pacemaker-shared-disk-environments/). I'm marking this as verification-done as it has stayed in -proposed for sometime now and no bad feedback was given from those who were asked to test it.

tags: added: verification-done verification-done-bionic
removed: block-proposed-bionic verification-needed verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package fence-agents - 4.0.25-2ubuntu1.1

---------------
fence-agents (4.0.25-2ubuntu1.1) bionic; urgency=medium

  * Fix fence_scsi agent arguments and node detection (LP: #1865523):
    + d/p/lp1865523-01-fencing-add-consistency-cmdline-stdin.patch AND
    + d/p/lp1865523-02-fix-for-ignored-options.patch
      - "fenced" sends commands through stdin
      - makes sure cmdline and stdin accept same args
      - normalize agents argument names (for pacemaker to work)
    + d/p/lp1865523-03-Maintain-ABI-compatibility.patch:
      - deal with old and new arguments from stdin
    + d/p/lp1865523-04-fence_scsi-Remove-period.patch:
      - minor string correction fixes fencing behavior
    + d/p/lp1865523-05-fence_scsi-fix-python3-encoding.patch:
      - encodes before calculating hash
    + d/p/lp1865523-06-fence_scsi-fixes-around-node-id.patch
      - use "plug" argument instead of "nodename" (external callers need)
      - keep previous "nodename" if no "plug" is given
      - issue calculating SCSI key on > 10 nodes
      - detect node ID using new format (plug + nodename)
    + d/p/lp1865523-07-fence-metadata-update.xml:
      - updates all agents XML definitions for build tests

 -- Rafael David Tinoco <email address hidden> Mon, 16 Mar 2020 18:55:29 +0000

Changed in fence-agents (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for fence-agents has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.