Bug #2004555 “[OSSA-2023-003] Unauthorized volume access through...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Jeremy Stanley (fungi) wrote on 2023-02-02:

#1

Since this report concerns a possible security risk, an incomplete
security advisory task has been added while the core security
reviewers for the affected project or projects confirm the bug and
discuss the scope of any vulnerability along with potential
solutions.

description:	updated
Changed in ossa:
status:	New → Incomplete

Revision history for this message

Dan Smith (danms) wrote on 2023-02-02:

#2

I feel like this is almost certainly something that will require involvement from the cinder people. Nova's part in the volume attachment is pretty minimal, in that we get stuff from cinder, pass it to brick, and then configure the guest with the block device we're told (AFAIK). Unless we're messing up the last step, I think it's likely this is not just a Nova thing. Should we add cinder or brick as an affected project or just add some cinder people to the bug here?

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2023-02-02:

#3

> Should we add cinder or brick as an affected project or just add some cinder people to the bug here?

I'd be in favor of adding the cinder project which would pull the cinder coresec team, right?

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2023-02-02:

#4

In the meantime, could you please provide us the block device mapping information that's stored in the DB and ideally the cinder-side attachment information ?

Putting the bug report to Incomplete, please mark its status back to New when you reply.

Changed in nova:
status:	New → Incomplete

Revision history for this message

Jan Wasilewski (janwasilewski) wrote on 2023-02-02:

#5

Download full text (45.5 KiB)

Hi,

below you can find requested information from OpenStack DB. There is no issue right now, but maybe historical tracking could list to some hint? Anyway, issue was related with /dev/vdb drive for instance: 128f1398-a7c5-48f8-8bbc-a132e3e2d556 -> in DB output you can observe that size of volume is 15GB, when directly from instance it was reported as 115GB(so vdb of second instance presented in this output)

mysql> select * from block_device_mapping where instance_uuid = '90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1';
+---------------------+---------------------+------------+--------+-------------+-----------------------+--------------------------------------+--------------------------------------+-------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+----------+------+--------------------------------------+--------------------------------------+-------------+
| created_at | updated_at | deleted_at | id | device_name | delete_on_termination | snapshot_id | volume_id | volume_size | no_device | connection_info | instance_uuid | deleted | source_type | destination_type | guest_format | device_type | disk_bus | boot_index | image_id | ta...

Hi,

below you can find requested information from OpenStack DB. There is no issue right now, but maybe historical tracking could list to some hint? Anyway, issue was related with /dev/vdb drive for instance: 128f1398-a7c5-48f8-8bbc-a132e3e2d556 -> in DB output you can observe that size of volume is 15GB, when directly from instance it was reported as 115GB(so vdb of second instance presented in this output)

mysql> select * from  block_device_mapping where instance_uuid = '90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1';
+---------------------+---------------------+------------+--------+-------------+-----------------------+--------------------------------------+--------------------------------------+-------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+----------+------+--------------------------------------+--------------------------------------+-------------+
| created_at          | updated_at          | deleted_at | id     | device_name | delete_on_termination | snapshot_id                          | volume_id                            | volume_size | no_device | connection_info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | instance_uuid                        | deleted | source_type | destination_type | guest_format | device_type | disk_bus | boot_index | image_id | tag  | attachment_id                        | uuid                                 | volume_type |
+---------------------+---------------------+------------+--------+-------------+-----------------------+--------------------------------------+--------------------------------------+-------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+----------+------+--------------------------------------+--------------------------------------+-------------+
| 2022-12-08 11:45:26 | 2023-01-27 16:52:32 | NULL       | 409637 | /dev/vda    |                     0 | f8c435fe-2dbe-46bd-a964-7d5d7f8532f7 | 5f61f246-abc5-4e4b-a22b-1d2594e244eb |          10 |         0 | {"driver_volume_type": "iscsi", "data": {"target_discovered": false, "hostlun_id": 30, "mappingview_id": "3542", "lun_id": "12138", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [30, 30, 30, 30], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false}, "status": "attaching", "instance": "90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1", "attached_at": "2023-01-27T16:52:28.000000", "detached_at": "", "volume_id": "5f61f246-abc5-4e4b-a22b-1d2594e244eb", "serial": "5f61f246-abc5-4e4b-a22b-1d2594e244eb"} | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 |       0 | snapshot    | volume           | NULL         | disk        | virtio   |          0 | NULL     | NULL | 09d8b28b-346a-4d3c-9835-1cef0047c702 | 6cbb9ce6-bc11-4de2-97b4-99879a4350f0 | 1000iops    |
| 2022-12-08 11:50:12 | 2023-01-27 16:52:31 | NULL       | 409640 | /dev/vdb    |                     0 | NULL                                 | 75b0e159-034c-4f6a-abb2-145a1e314e77 |         115 |      NULL | {"driver_volume_type": "iscsi", "data": {"target_discovered": false, "hostlun_id": 31, "mappingview_id": "3542", "lun_id": "12139", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [31, 31, 31, 31], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false}, "status": "attaching", "instance": "90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1", "attached_at": "2023-01-27T16:52:31.000000", "detached_at": "", "volume_id": "75b0e159-034c-4f6a-abb2-145a1e314e77", "serial": "75b0e159-034c-4f6a-abb2-145a1e314e77"} | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 |       0 | volume      | volume           | NULL         | disk        | virtio   |       NULL | NULL     | NULL | dece5935-ee5c-4d35-a258-1bc19b304f83 | 7eef2b0d-0c1a-4d3a-9388-921250b82ac4 | NULL        |
+---------------------+---------------------+------------+--------+-------------+-----------------------+--------------------------------------+--------------------------------------+-------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+----------+------+--------------------------------------+--------------------------------------+-------------+

mysql> select * from  block_device_mapping where instance_uuid = '128f1398-a7c5-48f8-8bbc-a132e3e2d556';
+---------------------+---------------------+------------+--------+-------------+-----------------------+--------------------------------------+--------------------------------------+-------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+----------+------+--------------------------------------+--------------------------------------+-------------+
| created_at          | updated_at          | deleted_at | id     | device_name | delete_on_termination | snapshot_id                          | volume_id                            | volume_size | no_device | connection_info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | instance_uuid                        | deleted | source_type | destination_type | guest_format | device_type | disk_bus | boot_index | image_id | tag  | attachment_id                        | uuid                                 | volume_type |
+---------------------+---------------------+------------+--------+-------------+-----------------------+--------------------------------------+--------------------------------------+-------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+----------+------+--------------------------------------+--------------------------------------+-------------+
| 2023-01-27 12:15:29 | 2023-01-27 16:50:46 | NULL       | 468725 | /dev/vda    |                     0 | 5b8ff61b-ac75-4289-b093-5aaf36686de2 | eb2e4515-45b7-4c3b-b31c-8e2fc3b9ef9b |          10 |         0 | {"driver_volume_type": "iscsi", "data": {"target_discovered": false, "hostlun_id": 28, "mappingview_id": "3542", "lun_id": "14481", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [28, 28, 28, 28], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false}, "status": "attaching", "instance": "128f1398-a7c5-48f8-8bbc-a132e3e2d556", "attached_at": "2023-01-27T16:50:42.000000", "detached_at": "", "volume_id": "eb2e4515-45b7-4c3b-b31c-8e2fc3b9ef9b", "serial": "eb2e4515-45b7-4c3b-b31c-8e2fc3b9ef9b"} | 128f1398-a7c5-48f8-8bbc-a132e3e2d556 |       0 | snapshot    | volume           | NULL         | disk        | virtio   |          0 | NULL     | NULL | ca24cffd-1de7-4db4-9d83-6cb8d421ad1d | d4f7bd0c-4264-462b-aa09-6485f3558e36 | 1000iops    |
| 2023-01-27 12:19:01 | 2023-01-27 16:50:46 | NULL       | 468728 | /dev/vdb    |                     0 | NULL                                 | df2b7131-9599-47cb-9250-49e5899a4f51 |          15 |      NULL | {"driver_volume_type": "iscsi", "data": {"target_discovered": false, "hostlun_id": 29, "mappingview_id": "3542", "lun_id": "14482", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [29, 29, 29, 29], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false}, "status": "attaching", "instance": "128f1398-a7c5-48f8-8bbc-a132e3e2d556", "attached_at": "2023-01-27T16:50:46.000000", "detached_at": "", "volume_id": "df2b7131-9599-47cb-9250-49e5899a4f51", "serial": "df2b7131-9599-47cb-9250-49e5899a4f51"} | 128f1398-a7c5-48f8-8bbc-a132e3e2d556 |       0 | volume      | volume           | NULL         | NULL        | NULL     |       NULL | NULL     | NULL | cf7e988c-791a-49e1-b4d1-d730d1460ed0 | 93f97ffe-9113-4bfd-bfa0-9fceac7e89d6 | NULL        |
+---------------------+---------------------+------------+--------+-------------+-----------------------+--------------------------------------+--------------------------------------+-------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+----------+------+--------------------------------------+--------------------------------------+-------------+

mysql> select * from volume_attachment where instance_uuid = '90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1';
+---------------------+---------------------+---------------------+---------+--------------------------------------+--------------------------------------+---------------+--------------------------------------+------------+---------------------+---------------------+-------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| created_at          | updated_at          | deleted_at          | deleted | id                                   | volume_id                            | attached_host | instance_uuid                        | mountpoint | attach_time         | detach_time         | attach_mode | attach_status | connection_info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | connector                                                                                                                                                                                                                                                                                |
+---------------------+---------------------+---------------------+---------+--------------------------------------+--------------------------------------+---------------+--------------------------------------+------------+---------------------+---------------------+-------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 2023-01-27 16:52:24 | 2023-01-27 16:52:39 | NULL                |       0 | 09d8b28b-346a-4d3c-9835-1cef0047c702 | 5f61f246-abc5-4e4b-a22b-1d2594e244eb | comphc-a01    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vda   | 2023-01-27 16:52:28 | NULL                | null        | attached      | {"target_discovered": false, "hostlun_id": 30, "mappingview_id": "3542", "lun_id": "12138", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [30, 30, 30, 30], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "09d8b28b-346a-4d3c-9835-1cef0047c702"} | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.0.234", "host": "comphc-a01", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:57947cb5cc16", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-AC1F6B4E3306", "mountpoint": "/dev/vda", "mode": null} |
| 2022-12-20 10:37:09 | 2022-12-20 10:37:26 | 2023-01-02 14:22:11 |       1 | 36851043-a5d9-4fdc-8103-ed809c131077 | 75b0e159-034c-4f6a-abb2-145a1e314e77 | comphc-a01    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vdb   | 2022-12-20 10:37:12 | 2023-01-02 14:22:11 | null        | detached      | {"target_discovered": false, "hostlun_id": 16, "mappingview_id": "3542", "lun_id": "12139", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [16, 16, 16, 16], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "36851043-a5d9-4fdc-8103-ed809c131077"} | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.0.234", "host": "comphc-a01", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:57947cb5cc16", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-AC1F6B4E3306", "mountpoint": "/dev/vdb", "mode": null} |
| 2022-12-08 11:50:12 | 2022-12-08 11:50:20 | 2022-12-20 10:38:44 |       1 | 490ab84f-f875-4ebf-a0a7-c22c260e7bb0 | 75b0e159-034c-4f6a-abb2-145a1e314e77 | comphc-a02    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vdb   | 2022-12-08 11:50:15 | 2022-12-20 10:38:44 | rw          | detached      | {"target_discovered": false, "hostlun_id": 10, "mappingview_id": "101", "lun_id": "12139", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [10, 10, 10, 10], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "490ab84f-f875-4ebf-a0a7-c22c260e7bb0"}  | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.1.114", "host": "comphc-a02", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:e7791652284b", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-3cecef0e241e", "mountpoint": "/dev/vdb"}               |
| 2022-12-20 10:37:04 | 2022-12-20 10:37:26 | 2023-01-02 14:22:09 |       1 | 6a44c982-4ed8-4940-b8a2-74a3f729e426 | 5f61f246-abc5-4e4b-a22b-1d2594e244eb | comphc-a01    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vda   | 2022-12-20 10:37:08 | 2023-01-02 14:22:09 | null        | detached      | {"target_discovered": false, "hostlun_id": 15, "mappingview_id": "3542", "lun_id": "12138", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [15, 15, 15, 15], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "6a44c982-4ed8-4940-b8a2-74a3f729e426"} | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.0.234", "host": "comphc-a01", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:57947cb5cc16", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-AC1F6B4E3306", "mountpoint": "/dev/vda", "mode": null} |
| 2023-01-02 14:22:08 | 2023-01-02 14:22:27 | 2023-01-27 09:01:28 |       1 | 6a875c66-f0f2-4c66-b22b-f2b1e934b96d | 5f61f246-abc5-4e4b-a22b-1d2594e244eb | comphc-b02    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vda   | 2023-01-02 14:22:17 | 2023-01-27 09:01:28 | rw          | detached      | {"target_discovered": false, "hostlun_id": 40, "mappingview_id": "3615", "lun_id": "12138", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [40, 40, 40, 40], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "6a875c66-f0f2-4c66-b22b-f2b1e934b96d"} | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.1.115", "host": "comphc-b02", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:ce501c96eca6", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-3CECEF0E2420", "mountpoint": "/dev/vda"}               |
| 2023-01-02 14:22:10 | 2023-01-02 14:22:27 | 2023-01-27 09:01:29 |       1 | 721d3929-0fad-4293-a736-93ecbd4bc210 | 75b0e159-034c-4f6a-abb2-145a1e314e77 | comphc-b02    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vdb   | 2023-01-02 14:22:20 | 2023-01-27 09:01:29 | rw          | detached      | {"target_discovered": false, "hostlun_id": 41, "mappingview_id": "3615", "lun_id": "12139", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [41, 41, 41, 41], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "721d3929-0fad-4293-a736-93ecbd4bc210"} | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.1.115", "host": "comphc-b02", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:ce501c96eca6", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-3CECEF0E2420", "mountpoint": "/dev/vdb"}               |
| 2022-12-08 11:45:48 | 2022-12-08 11:45:51 | 2022-12-20 10:38:43 |       1 | 7cc17cfa-01eb-4ada-a666-5b27ba677d66 | 5f61f246-abc5-4e4b-a22b-1d2594e244eb | comphc-a02    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vda   | 2022-12-08 11:45:51 | 2022-12-20 10:38:43 | rw          | detached      | {"target_discovered": false, "hostlun_id": 9, "mappingview_id": "101", "lun_id": "12138", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [9, 9, 9, 9], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "7cc17cfa-01eb-4ada-a666-5b27ba677d66"}       | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.1.114", "host": "comphc-a02", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:e7791652284b", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-3cecef0e241e", "mountpoint": "/dev/vda"}               |
| 2023-01-27 09:01:27 | 2023-01-27 09:01:47 | 2023-01-27 16:53:09 |       1 | a121513d-5e5a-4fdb-acab-704bb8d06792 | 5f61f246-abc5-4e4b-a22b-1d2594e244eb | comphc-a02    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vda   | 2023-01-27 09:01:36 | 2023-01-27 16:53:09 | rw          | detached      | {"target_discovered": false, "hostlun_id": 3, "mappingview_id": "906", "lun_id": "12138", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [3, 3, 3, 3], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "a121513d-5e5a-4fdb-acab-704bb8d06792"}       | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.1.114", "host": "comphc-a02", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:e7791652284b", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-3cecef0e241e", "mountpoint": "/dev/vda"}               |
| 2023-01-27 09:01:28 | 2023-01-27 09:01:47 | 2023-01-27 16:53:10 |       1 | a9362ba1-155d-4f35-94aa-846fcef6fab6 | 75b0e159-034c-4f6a-abb2-145a1e314e77 | comphc-a02    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vdb   | 2023-01-27 09:01:39 | 2023-01-27 16:53:10 | rw          | detached      | {"target_discovered": false, "hostlun_id": 4, "mappingview_id": "906", "lun_id": "12139", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [4, 4, 4, 4], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "a9362ba1-155d-4f35-94aa-846fcef6fab6"}       | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.1.114", "host": "comphc-a02", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:e7791652284b", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-3cecef0e241e", "mountpoint": "/dev/vdb"}               |
| 2023-01-27 16:52:28 | 2023-01-27 16:52:39 | NULL                |       0 | dece5935-ee5c-4d35-a258-1bc19b304f83 | 75b0e159-034c-4f6a-abb2-145a1e314e77 | comphc-a01    | 90fda439-fc0e-4e22-8d0b-6f2a18eeb9c1 | /dev/vdb   | 2023-01-27 16:52:31 | NULL                | null        | attached      | {"target_discovered": false, "hostlun_id": 31, "mappingview_id": "3542", "lun_id": "12139", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [31, 31, 31, 31], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "dece5935-ee5c-4d35-a258-1bc19b304f83"} | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.0.234", "host": "comphc-a01", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:57947cb5cc16", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-AC1F6B4E3306", "mountpoint": "/dev/vdb", "mode": null} |
+---------------------+---------------------+---------------------+---------+--------------------------------------+--------------------------------------+---------------+--------------------------------------+------------+---------------------+---------------------+-------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

mysql> select * from volume_attachment where instance_uuid = '128f1398-a7c5-48f8-8bbc-a132e3e2d556';
+---------------------+---------------------+---------------------+---------+--------------------------------------+--------------------------------------+---------------+--------------------------------------+------------+---------------------+---------------------+-------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| created_at          | updated_at          | deleted_at          | deleted | id                                   | volume_id                            | attached_host | instance_uuid                        | mountpoint | attach_time         | detach_time         | attach_mode | attach_status | connection_info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | connector                                                                                                                                                                                                                                                                                |
+---------------------+---------------------+---------------------+---------+--------------------------------------+--------------------------------------+---------------+--------------------------------------+------------+---------------------+---------------------+-------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 2023-01-27 12:15:51 | 2023-01-27 12:15:54 | 2023-01-27 16:52:01 |       1 | 1457d22b-c539-4b87-ace4-28c73fe12199 | eb2e4515-45b7-4c3b-b31c-8e2fc3b9ef9b | comphc-a02    | 128f1398-a7c5-48f8-8bbc-a132e3e2d556 | /dev/vda   | 2023-01-27 12:15:54 | 2023-01-27 16:52:01 | rw          | detached      | {"target_discovered": false, "hostlun_id": 7, "mappingview_id": "906", "lun_id": "14481", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [7, 7, 7, 7], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "1457d22b-c539-4b87-ace4-28c73fe12199"}       | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.1.114", "host": "comphc-a02", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:e7791652284b", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-3cecef0e241e", "mountpoint": "/dev/vda"}               |
| 2023-01-27 12:19:01 | 2023-01-27 12:19:10 | 2023-01-27 16:52:02 |       1 | 7960bdb7-b724-4e20-8e68-aa9ba3ccfd86 | df2b7131-9599-47cb-9250-49e5899a4f51 | comphc-a02    | 128f1398-a7c5-48f8-8bbc-a132e3e2d556 | /dev/vdb   | 2023-01-27 12:19:05 | 2023-01-27 16:52:02 | rw          | detached      | {"target_discovered": false, "hostlun_id": 8, "mappingview_id": "906", "lun_id": "14482", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [8, 8, 8, 8], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "7960bdb7-b724-4e20-8e68-aa9ba3ccfd86"}       | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.1.114", "host": "comphc-a02", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:e7791652284b", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-3cecef0e241e", "mountpoint": "/dev/vdb"}               |
| 2023-01-27 16:50:39 | 2023-01-27 16:50:53 | NULL                |       0 | ca24cffd-1de7-4db4-9d83-6cb8d421ad1d | eb2e4515-45b7-4c3b-b31c-8e2fc3b9ef9b | comphc-a01    | 128f1398-a7c5-48f8-8bbc-a132e3e2d556 | /dev/vda   | 2023-01-27 16:50:42 | NULL                | null        | attached      | {"target_discovered": false, "hostlun_id": 28, "mappingview_id": "3542", "lun_id": "14481", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [28, 28, 28, 28], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "ca24cffd-1de7-4db4-9d83-6cb8d421ad1d"} | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.0.234", "host": "comphc-a01", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:57947cb5cc16", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-AC1F6B4E3306", "mountpoint": "/dev/vda", "mode": null} |
| 2023-01-27 16:50:42 | 2023-01-27 16:50:53 | NULL                |       0 | cf7e988c-791a-49e1-b4d1-d730d1460ed0 | df2b7131-9599-47cb-9250-49e5899a4f51 | comphc-a01    | 128f1398-a7c5-48f8-8bbc-a132e3e2d556 | /dev/vdb   | 2023-01-27 16:50:46 | NULL                | null        | attached      | {"target_discovered": false, "hostlun_id": 29, "mappingview_id": "3542", "lun_id": "14482", "target_iqns": ["iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20000:10.10.16.200", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::20001:10.10.16.201", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020000:10.10.16.202", "iqn.2006-08.com.huawei:oceanstor:2100e00084ee7e7e::1020001:10.10.16.203"], "target_portals": ["10.10.16.200:3260", "10.10.16.201:3260", "10.10.16.202:3260", "10.10.16.203:3260"], "target_luns": [29, 29, 29, 29], "discard": true, "qos_specs": {"IOType": "2", "maxIOPS": "1000"}, "access_mode": "rw", "encrypted": false, "driver_volume_type": "iscsi", "attachment_id": "cf7e988c-791a-49e1-b4d1-d730d1460ed0"} | {"platform": "x86_64", "os_type": "linux", "ip": "10.10.0.234", "host": "comphc-a01", "multipath": true, "initiator": "iqn.1993-08.org.debian:01:57947cb5cc16", "do_local_attach": false, "system uuid": "00000000-0000-0000-0000-AC1F6B4E3306", "mountpoint": "/dev/vdb", "mode": null} |
+---------------------+---------------------+---------------------+---------+--------------------------------------+--------------------------------------+---------------+--------------------------------------+------------+---------------------+---------------------+-------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Changed in nova:
status:	Incomplete → New

Revision history for this message

Jeremy Stanley (fungi) wrote on 2023-02-02:

#6

I've added Cinder as an effected project (though maybe it should be os-brick?) and subscribed the Cinder security reviewers for additional input.

Revision history for this message

Rajat Dhasmana (whoami-rajat) wrote on 2023-02-03:

#7

Hi,

Based on the given information, the strange part is same multipath device is used for the old and new volume 36e00084100ee7e7ed6ad25d900002f6b

36e00084100ee7e7ed6ad25d900002f6b dm-9 HUAWEI,XSG1
size=115G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 14:0:0:4 sdm 8:192 active ready running
  |- 15:0:0:4 sdo 8:224 active ready running
  |- 16:0:0:4 sdl 8:176 active ready running
  `- 17:0:0:4 sdn 8:208 active ready running

36e00084100ee7e7ed6ad25d900002f6b dm-9 HUAWEI,XSG1
size=115G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 14:0:0:10 sdao 66:128 failed faulty running
  |- 14:0:0:4 sdm 8:192 active ready running
  |- 15:0:0:10 sdap 66:144 failed faulty running
  |- 15:0:0:4 sdo 8:224 active ready running
  |- 16:0:0:10 sdan 66:112 failed faulty running
  |- 16:0:0:4 sdl 8:176 active ready running
  |- 17:0:0:10 sdaq 66:160 failed faulty running
  `- 17:0:0:4 sdn 8:208 active ready running

Also it's interesting to note that the paths under the multipath device (sdm, sdo, sdl, sdn) with LUN ID: 4 are also used by the second multipath device whereas it should use LUN 10 paths (which is currently in failed faulty status).

This looks multipath related but it would be helpful if we can get the os-brick logs for this 1GB volume attachment to understand if os-brick is doing something that is resulting in this.

I would also recommend to cleanup the system with any leftover devices of past failed detachments (i.e. flush and remove mpath devices not belonging to any instance) that might be interfering with this. Although I'm not certain if that's the case, it's still to cleanup those devices.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-03:

#8

Download full text (3.2 KiB)

Hi,

I think I know what happened, but there are some things that don't match unless
somebody has manually changed some things in the host (like cleaning up
multipaths).

Bit of context:

- SCSI volumes (iSCSI and FC) on Linux are NEVER removed automatically by the
  kernel and must always be removed explicitly. This means that they will
  remain in the system even if the remote connection is severed, unless
  something in OpenStack removes it.

- The os-brick library has a strong policy of not removing devices from the
system if flushing fails during detach, to prevent data loss.

  The `disconnect_volume` method in the os-brick library has an additional
  parameter called `force` to allow callers to ignore flushing errors and
  ensure that the devices are being removed. This is useful when after failing
  the detach the volume is either going to be deleted or into error status.

I don't have the logs, but from what you said my guess is that this is what has
happened:

- Volume with SCSI ID 36e00084100ee7e7ed6ad25d900002f6b was attached to that
host on LUN 10 at some point since the last reboot (sdao, sdap, sdan, sdaq).

- When detaching the volume from the host using os-brick the operation failed
  and it wasn't removed, yet Nova still called Cinder to unexport and unmap the
  volume. At this point LUN 10 is free on the Huawei array and the volume is
  no longer attacheable, but /dev/sda[o-q] are still present, and their SCSI_ID
  are still known to multipathd.

- Nova asked Cinder to attach the volume again, and the volume is mapped to LUN
4 (which must have been available as well) and it successfully attaches (sdm,
sdo, sdl, sdn), appears as a multipath, and is used by the VM.

- Nova asks Cinder to export and map the new 1GB volume, and Huawai maps it to
  LUN 10, at this point iSCSI detects that the remote LUNs are back and
  reconnects to them, which makes the multipathd path checker detect sdao,
  sdap, sdan, sdaq are alive on the compute host and they are added to the
  existing multipath device mapper using their known SCSI ID.

You should find out why the detach actually failed, but I think I see multiple
issues:

- Nova:

  - Should not call Cinder to unmap a volume if the os-brick to disconnect the
    volume has failed, as we know this will leave leftover devices that can
    cause issues like this.

- If it's not already doing it, Nova should call disconnect_volume method
from os-brick passing force=True when the volume is going to be deleted.

- os-brick:

  - Should try to detect when the newly added devices are being added to a
    multipath device mapper that has live paths to other LUNs and fail if that
    is the case.

  - As an improvement over the previous check, os-brick could forcefully remove
    those devices that are in the wrong device mapper, force a refresh of their
    SCSI IDs and add them back to multipathd to form a new device mapper.
    Though personally this is a non trivial and maybe potentially problematic
    feature.

In other words, the source of the problem is probably Nova, but os-brick should
try to prevent these possible data leaks.

Cheers,
Gorka.

[1]: https://github.com/opens...

Hi,

I think I know what happened, but there are some things that don't match unless
somebody has manually changed some things in the host (like cleaning up
multipaths).

Bit of context:

- SCSI volumes (iSCSI and FC) on Linux are NEVER removed automatically by the
  kernel and must always be removed explicitly.  This means that they will
  remain in the system even if the remote connection is severed, unless
  something in OpenStack removes it.

- The os-brick library has a strong policy of not removing devices from the
  system if flushing fails during detach, to prevent data loss.

The `disconnect_volume` method in the os-brick library has an additional
  parameter called `force` to allow callers to ignore flushing errors and
  ensure that the devices are being removed.  This is useful when after failing
  the detach the volume is either going to be deleted or into error status.

I don't have the logs, but from what you said my guess is that this is what has
happened:

- Volume with SCSI ID 36e00084100ee7e7ed6ad25d900002f6b was attached to that
  host on LUN 10 at some point since the last reboot (sdao, sdap, sdan, sdaq).

- When detaching the volume from the host using os-brick the operation failed
  and it wasn't removed, yet Nova still called Cinder to unexport and unmap the
  volume.  At this point LUN 10 is free on the Huawei array and the volume is
  no longer attacheable, but /dev/sda[o-q] are still present, and their SCSI_ID
  are still known to multipathd.

- Nova asked Cinder to attach the volume again, and the volume is mapped to LUN
  4 (which must have been available as well) and it successfully attaches (sdm,
  sdo, sdl, sdn), appears as a multipath, and is used by the VM.

- Nova asks Cinder to export and map the new 1GB volume, and Huawai maps it to
  LUN 10, at this point iSCSI detects that the remote LUNs are back and
  reconnects to them, which makes the multipathd path checker detect sdao,
  sdap, sdan, sdaq are alive on the compute host and they are added to the
  existing multipath device mapper using their known SCSI ID.

You should find out why the detach actually failed, but I think I see multiple
issues:

- Nova:

- Should not call Cinder to unmap a volume if the os-brick to disconnect the
    volume has failed, as we know this will leave leftover devices that can
    cause issues like this.

- If it's not already doing it, Nova should call disconnect_volume method
    from os-brick passing force=True when the volume is going to be deleted.

- os-brick:

- Should try to detect when the newly added devices are being added to a
    multipath device mapper that has live paths to other LUNs and fail if that
    is the case.

- As an improvement over the previous check, os-brick could forcefully remove
    those devices that are in the wrong device mapper, force a refresh of their
    SCSI IDs and add them back to multipathd to form a new device mapper.
    Though personally this is a non trivial and maybe potentially problematic
    feature.

In other words, the source of the problem is probably Nova, but os-brick should
try to prevent these possible data leaks.

Cheers,
Gorka.

[1]: https://github.com/openstack/os-brick/blob/655fcc41b33d3f6afc8f85005868d0111077bdb5/os_brick/initiator/connectors/iscsi.py#L858

Revision history for this message

Dan Smith (danms) wrote on 2023-02-03:

#9

I don't see in the test scenario description that any instances had to be deleted or volumes disconnected for this to happen. Maybe the reporter can confirm with logs if this is the case?

I'm still chasing down the nova calls, but we don't ignore anything in the actual disconnect other than "volume not found". I need to follow that up to where we call cinder to see if we're ignoring a failure.

When you say "nova should call disconnect_volume with force=true if the volume is going to be deleted... I'm not sure what you mean by this. Do you mean if we're disconnecting because of *instance* delete and are sure that we don't want to let a failure hold us up? I would think this would be dangerous because just deleting an instance doesn't mean you don't care about the data in the volume.

It seems to me that if brick *has* the information available to it to avoid connecting a volume to the wrong location, that it's the thing that needs to guard against this. Nova has no knowledge of the things underneath brick, so we don't know that wires are going to get crossed. Obviously if we can do stuff to avoid even getting there, then we should.

Revision history for this message

Jan Wasilewski (janwasilewski) wrote on 2023-02-03:

#10

Hi,

I'm just wondering if there is a chance for me to try to reproduce an issue again with all debug flags set to on. Should I turn on this flag on controllers(cinder, nova) or compute node logs(with debug flags set to on) should be enough to further troubleshoot this issue? If yes, please let me know which flags are needed here, just to speed up further troubleshooting. As I said - this case is not easy to reproduce, I can't even say what is a trigger here, but we faced it 3 or 4 times already.

Thanks in advance for reply and your helps so far.

Best regards,
Jan

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-03:

#11

Apologies if I wasn't clear enough.

The disconnect call I say it's probably being ignored/swallowed is the one to os-brick, not Cinder. In other words, Nova first calls os-brick to disconnect the volume from the compute host and then always considers this as successful (at least in some scenarios, probably instance destruction). Since it always considers in those scenarios that local disconnect was successful it calls Cinder to unmap/unexport the volume.

The force=True parameter to os-brick's disconnect_volume should only be added when the BDM for the volume has the delete on disconnect flag thingy.

OS-Brick has the information, the problem is that multipathd is the one that is adding the leftover devices that have been reused to the multipath device mapper.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-03:

#12

A solution/workaround would be to change /etc/multipath.conf and set "recheck_wwid" to yes.

I haven't actually tested it myself, but the documentation explicitly calls out that it's used to solve this specific issue: "If set to yes, when a failed path is restored, the multipathd daemon rechecks the path WWID. If there is a change in the WWID, the path is removed from the current multipath device, and added again as a new path. The multipathd daemon also checks the path WWID again if it is manually re-added."

I believe this is probably something that is best fixed at the deployment tool level. For example extending the multipathing THT template code [1] to support "recheck_wwid" and defaulting it to yes instead to no like multipath.conf does.

[1]: https://opendev.org/openstack/tripleo-heat-templates/commit/906d03ea19a4446ed198c321f68791b7fa6e0c47

Revision history for this message

Dan Smith (danms) wrote on 2023-02-03:

#13

Okay, thanks for the clarification.

Yeah, recheck_wwid seems like it should *always* be on to prevent potentially reconnecting to the wrong thing!

Revision history for this message

Jeremy Stanley (fungi) wrote on 2023-02-03:

#14

If that configuration ends up being the recommended solution, we might want to consider drafting a brief security note with guidance for deployers and maintainers of deployment tooling.

Unless I misunderstand the conditions necessary, it sounds like it would be challenging for a malicious user to force this problem to occur. Is that the current thinking? If so, we could probably safely work on the actual text of the note in public.

Revision history for this message

melanie witt (melwitt) wrote on 2023-02-04:

#16

> The disconnect call I say it's probably being ignored/swallowed is the one to os-brick, not Cinder. In other words, Nova first calls os-brick to disconnect the volume from the compute host and then always considers this as successful (at least in some scenarios, probably instance destruction). Since it always considers in those scenarios that local disconnect was successful it calls Cinder to unmap/unexport the volume.

I just checked and indeed Nova will ignore a volume disconnect error in the case of an instance being deleted [1]:

    try:
        self._disconnect_volume(context, connection_info, instance)
    except Exception as exc:
        with excutils.save_and_reraise_exception() as ctxt:
            if cleanup_instance_disks:
                # Don't block on Volume errors if we're trying to
                # delete the instance as we may be partially created
                # or deleted
                ctxt.reraise = False
                LOG.warning(
                    "Ignoring Volume Error on vol %(vol_id)s "
                    "during delete %(exc)s",
                    {'vol_id': vol.get('volume_id'),
                     'exc': encodeutils.exception_to_unicode(exc)},
                    instance=instance)

In all other scenarios, Nova will not proceed further if the disconnect was not successful.

If Nova does proceed past _disconnect_volume(), it will later call Cinder API to delete the attachment [2]. I assume that is what does the unmap/unexport.

[1] https://github.com/openstack/nova/blob/1bf98f128710c374a0141720a7ccc21f5d1afae0/nova/virt/libvirt/driver.py#L1445-L1459 (ussuri)
[2] https://github.com/openstack/nova/blob/1bf98f128710c374a0141720a7ccc21f5d1afae0/nova/compute/manager.py#L2922 (ussuri)

Revision history for this message

Jan Wasilewski (janwasilewski) wrote on 2023-02-06:

#17

I believe it can be a bit challenging for ubuntu users to introduce recheck_wwid parameter. What I checked already, this parameter is available for multipath-tools, but the package which provides it is on-board with ubuntu 22.04LTS. Older ubuntu releases do not have this possibility and gives an error:
/etc/multipath.conf line XX, invalid keyword: recheck_wwid

I made such assumption based on release documentation:
- for ubuntu 20.04: https://manpages.ubuntu.com/manpages/focal/en/man5/multipath.conf.5.html
- for ubuntu 22.04: https://manpages.ubuntu.com/manpages/jammy/en/man5/multipath.conf.5.html

So it seems that partially Yoga, but fully Zed OS release can take such parameter directly, but older releases should manage such change differently.

I know that OpenStack code is independent of linux distros, but just wanted to add this info here, as worth to consider.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-08:

#18

I don't know if my assumption is correct or not, because I can't reproduce the multipath device mapper situation from the report (some failed some active) no matter how much I force things to break in different ways.

Since each iSCSI storage backend behaves differently I don't know if I can't reproduce it because the difference in behavior or because the way I'm trying to reproduce it is different. It may even be that multipathd is different in my system.

Unfortunately I don't know if the host where that happened had leftover devices before the leak happened, or what the SCSI IDs of the 2 volumes involved really are.

From os-brick's connect_volume perspective what it did is the right thing, because when it looked at the multipath device containing the newly connected devices it was dm-9, so that's the one that it should return.

How multipath ended up with 2 different volumes in the same device mapper, I don't know.

I don't think "recheck_wwid" would solve the issue because os-brick would be too fast in finding the multipath and it wouldn't give enough time for multipathd to activate the paths and form a new device mapper.

In any case I strongly believe that nova should never proceed to delete the cinder attachment if detaching with os-brick fails because that usually implies data loss.

The exception would be when the cinder volume is going to be delete after disconnecting it, and in that case the disconnect call to os-brick should be always forced, since data loss is irrelevant.

That would ensure that compute nodes are not left with leftover devices that could cause problems.

I'll see if I can find a reasonable improvement in os-brick that would detect this issues and fail the connection, although it's probably going to be a bit of a mess.

Revision history for this message

Jan Wasilewski (janwasilewski) wrote on 2023-02-09:

#19

Download full text (6.3 KiB)

@Gorka Eguileor: I can try to reproduce this case with recheck_wwid option set to true when a valid package of multipath-tools will be available for ubuntu 20.04.

What I can add is that it happened only for one compute node, but I've seen similar warnings in other compute nodes in dmesg -T output, which looks dangerously, but so far I haven't faced similar issue there:

[Thu Feb 9 14:28:16 2023] scsi_io_completion: 42 callbacks suppressed
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 CDB: Read(10) 28 00 03 bf ff 00 00 00 08 00
[Thu Feb 9 14:28:16 2023] print_req_error: 42 callbacks suppressed
[Thu Feb 9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914304
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 CDB: Read(10) 28 00 03 bf ff 00 00 00 01 00
[Thu Feb 9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914304
[Thu Feb 9 14:28:16 2023] buffer_io_error: 30 callbacks suppressed
[Thu Feb 9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686976, async page read
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 CDB: Read(10) 28 00 03 bf ff 01 00 00 01 00
[Thu Feb 9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914305
[Thu Feb 9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686977, async page read
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 CDB: Read(10) 28 00 03 bf ff 02 00 00 01 00
[Thu Feb 9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914306
[Thu Feb 9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686978, async page read
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 Sense Key : Illegal Request [current]
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 Add. Sense: Logical unit not supported
[Thu Feb 9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 CDB: Read(10) 28 00 03 bf ff 03 00 00 01 00
[Thu Feb 9 14:28:16 2023] print_req...

@Gorka Eguileor: I can try to reproduce this case with recheck_wwid option set to true when a valid package of multipath-tools will be available for ubuntu 20.04.

What I can add is that it happened only for one compute node, but I've seen similar warnings in other compute nodes in dmesg -T output, which looks dangerously, but so far I haven't faced similar issue there:

[Thu Feb  9 14:28:16 2023] scsi_io_completion: 42 callbacks suppressed
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 CDB: Read(10) 28 00 03 bf ff 00 00 00 08 00
[Thu Feb  9 14:28:16 2023] print_req_error: 42 callbacks suppressed
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914304
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#2 CDB: Read(10) 28 00 03 bf ff 00 00 00 01 00
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914304
[Thu Feb  9 14:28:16 2023] buffer_io_error: 30 callbacks suppressed
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686976, async page read
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 CDB: Read(10) 28 00 03 bf ff 01 00 00 01 00
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914305
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686977, async page read
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#4 CDB: Read(10) 28 00 03 bf ff 02 00 00 01 00
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914306
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686978, async page read
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#5 CDB: Read(10) 28 00 03 bf ff 03 00 00 01 00
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914307
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686979, async page read
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#6 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#6 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#6 CDB: Read(10) 28 00 03 bf ff 04 00 00 01 00
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914308
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686980, async page read
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#7 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#7 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#7 CDB: Read(10) 28 00 03 bf ff 05 00 00 01 00
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914309
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686981, async page read
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#8 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#8 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#8 CDB: Read(10) 28 00 03 bf ff 06 00 00 01 00
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914310
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686982, async page read
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#9 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#9 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#9 CDB: Read(10) 28 00 03 bf ff 07 00 00 01 00
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 62914311
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr1, logical block 62686983, async page read
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 Sense Key : Illegal Request [current] 
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 Add. Sense: Logical unit not supported
[Thu Feb  9 14:28:16 2023] sd 15:0:0:98: [sdgr] tag#3 CDB: Read(10) 28 00 00 00 27 80 00 00 08 00
[Thu Feb  9 14:28:16 2023] print_req_error: I/O error, dev sdgr, sector 10112
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr14, logical block 1008, async page read
[Thu Feb  9 14:28:16 2023] Buffer I/O error on dev sdgr15, logical block 27120, async page read

That repeats every ~15 minutes.
Anyway, I would like to add such prints from dmesg, maybe this way it can show you something which I can't see from my pov.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-09:

#20

Don't bother trying with recheck_wwid, as it won't work due to the speed of os-brick.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-09:

#21

I have finally been able to reproduce the issue.

So far I have been able to identify 3 different ways to create similar situations to the reported one, and it was what I thought, leftover devices from a 'nova delete' call.

Took me longer to figure it out because it requires an iSCSI Cinder driver that uses shared targets, and the one I use doesn't.

After I locally modified the cinder driver code to do target sharing and then force a disconnect error on specific Nova calls to os-brick I was able to work it out.

I have a local patch that detects these issues and fixes them the best it can, but I wouldn't like to backport that, because the fixing is a bit scary as a backport.

So I'll split the code into 2 patches:

- The backportable patch that detects and prevents the connection if a potential leak is detected. To fix this manual intervention will be necessary.

- Another patch that extends the previous code to try to fix things when possible.

Revision history for this message

melanie witt (melwitt) wrote on 2023-02-09:

#22

> In any case I strongly believe that nova should never proceed to delete the cinder attachment if detaching with os-brick fails because that usually implies data loss.

> The exception would be when the cinder volume is going to be delete after disconnecting it, and in that case the disconnect call to os-brick should be always forced, since data loss is irrelevant.

> That would ensure that compute nodes are not left with leftover devices that could cause problems.

Understood. I guess that must mean that the reported bug scenario is a volume that is *not* delete_on_termination=True attached to an instance that is being deleted.

I think we could probably propose a patch in nova to not delete the attachment if it's instance delete + not delete_on_termination.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-10:

#23

Hi Melanie,

In my opinion there should be 2 code changes to prevent leaving devices behind:

- Instance deletion operation should fail like the normal volume-detach call when the disconnect_volume call fails, even if the instance is left in a "weird" state, manual intervention is usually necessary to fix things.
This manual intervention does not necessarily mean doing something to the volume, it can be fixing the network.

- Any Cinder volume with delete_on_termination=True should have the os-brick call to disconnect_volume with "force=True, ignore_errors=True" parameters.
The tricky part here is that not all os-brick connectors support the force parameter, so when the call fails we have to decide whether to halt the operation and wait for human intervention, or just log it and continue as we are doing today.
We could make an effort in os-brick to increase coverage of the force parameter.

Thanks,
Gorka.

Revision history for this message

Dan Smith (danms) wrote on 2023-02-10:

#24

Our policy is that instance delete should never fail, and I think that's the experience the users expect. Perhaps we need to still mark the instance deleted immediately and continue retrying the volume detach in a periodic until it succeeds, but that's the only thing I can see working.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2023-02-10:

#25

Agree with Dan, we shouldn't raise an exception on instance delete but rather possibly make some status available for knowing whether the volume was eventually detached.

For example, we accept to delete an instance if the compute goes down (as the user may not know that the underlying compute is in a bad state) and we only delete the instance when the compute is back.

That being said, I don't really see how we can easily fix this in a patch as we should discuss this correctly. Would a LOG statement adverting that the volume connection is still present would help ?

Revision history for this message

melanie witt (melwitt) wrote on 2023-02-10:

#27

Download full text (3.2 KiB)

We definitely should not allow a delete to fail from a user's perspective.

My suggestion of a patch to not delete an attachment when detach fails during instance delete if delete_on_termination=False is intended to be better than what we have today, not necessarily to be perfect.

We could consider doing a periodic like Dan mentions. We already do similar with our "cleanup running deleted instances" periodic. The volume attachment cleanup could be hooked into that if it doesn't already do it.

From what I can tell, our periodic is already capable of taking care of it, but it's not enabled [1][2]:

    elif action == 'reap':
        LOG.info("Destroying instance with name label "
                 "'%s' which is marked as "
                 "DELETED but still present on host.",
                 instance.name, instance=instance)
        bdms = objects.BlockDeviceMappingList.get_by_instance_uuid(
            context, instance.uuid, use_slave=True)
        self.instance_events.clear_events_for_instance(instance)
        try:
            self._shutdown_instance(context, instance, bdms,
                                    notify=False)
            self._cleanup_volumes(context, instance, bdms,
                                  detach=False)

    def _cleanup_volumes(self, context, instance, bdms, raise_exc=True,
                         detach=True):
        original_exception = None
        for bdm in bdms:
            if detach and bdm.volume_id:
                try:
                    LOG.debug("Detaching volume: %s", bdm.volume_id,
                              instance_uuid=instance.uuid)
                    destroy = bdm.delete_on_termination
                    self._detach_volume(context, bdm, instance,
                                        destroy_bdm=destroy)
                except Exception as exc:
                    original_exception = exc
                    LOG.warning('Failed to detach volume: %(volume_id)s '
                                'due to %(exc)s',
                                {'volume_id': bdm.volume_id, 'exc': exc})

            if bdm.volume_id and bdm.delete_on_termination:
                try:
                    LOG.debug("Deleting volume: %s", bdm.volume_id,
                              instance_uuid=instance.uuid)
                    self.volume_api.delete(context, bdm.volume_id)
                except Exception as exc:
                    original_exception = exc
                    LOG.warning('Failed to delete volume: %(volume_id)s '
                                'due to %(exc)s',
                                {'volume_id': bdm.volume_id, 'exc': exc})
        if original_exception is not None and raise_exc:
            raise original_exception

Currently we're calling _cleanup_volumes with detach=False. Not sure what the reason for that is but if we determine there should be no problems with it, we can change it to detach=True in combination with not deleting the attachment on instance delete if delete_on_termination=False.

[1] https://github.com/openstack/nova/blob/a2964417822bd1a4a83fa5c27282d2be1e18868a/nova/compute/manager.py#L10579
[2] https://github.com/openstack/nova/blob/a2964417822bd1a4a83f...

We definitely should not allow a delete to fail from a user's perspective.

My suggestion of a patch to not delete an attachment when detach fails during instance delete if delete_on_termination=False is intended to be better than what we have today, not necessarily to be perfect.

We could consider doing a periodic like Dan mentions. We already do similar with our "cleanup running deleted instances" periodic. The volume attachment cleanup could be hooked into that if it doesn't already do it.

From what I can tell, our periodic is already capable of taking care of it, but it's not enabled [1][2]:

elif action == 'reap':
        LOG.info("Destroying instance with name label "
                 "'%s' which is marked as "
                 "DELETED but still present on host.",
                 instance.name, instance=instance)
        bdms = objects.BlockDeviceMappingList.get_by_instance_uuid(
            context, instance.uuid, use_slave=True)
        self.instance_events.clear_events_for_instance(instance)
        try:
            self._shutdown_instance(context, instance, bdms,
                                    notify=False)
            self._cleanup_volumes(context, instance, bdms,
                                  detach=False)

def _cleanup_volumes(self, context, instance, bdms, raise_exc=True,
                         detach=True):
        original_exception = None
        for bdm in bdms:
            if detach and bdm.volume_id:
                try:
                    LOG.debug("Detaching volume: %s", bdm.volume_id,
                              instance_uuid=instance.uuid)
                    destroy = bdm.delete_on_termination
                    self._detach_volume(context, bdm, instance,
                                        destroy_bdm=destroy)
                except Exception as exc:
                    original_exception = exc
                    LOG.warning('Failed to detach volume: %(volume_id)s '
                                'due to %(exc)s',
                                {'volume_id': bdm.volume_id, 'exc': exc})

if bdm.volume_id and bdm.delete_on_termination:
                try:
                    LOG.debug("Deleting volume: %s", bdm.volume_id,
                              instance_uuid=instance.uuid)
                    self.volume_api.delete(context, bdm.volume_id)
                except Exception as exc:
                    original_exception = exc
                    LOG.warning('Failed to delete volume: %(volume_id)s '
                                'due to %(exc)s',
                                {'volume_id': bdm.volume_id, 'exc': exc})
        if original_exception is not None and raise_exc:
            raise original_exception

Currently we're calling _cleanup_volumes with detach=False. Not sure what the reason for that is but if we determine there should be no problems with it, we can change it to detach=True in combination with not deleting the attachment on instance delete if delete_on_termination=False.

[1] https://github.com/openstack/nova/blob/a2964417822bd1a4a83fa5c27282d2be1e18868a/nova/compute/manager.py#L10579
[2] https://github.com/openstack/nova/blob/a2964417822bd1a4a83fa5c27282d2be1e18868a/nova/compute/manager.py#L3183-L3211

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-13:

#28

What is the reason why Nova has the policy that deleting the instance should never fail?

I'm talking about the instance record, not the VM itself, because I agree that the VM should always be deleted to free resources.

From my perspective deleting the instance record would result in a very weird user experience and in users manually creating the same situation we are trying to avoid.

- User requests instance deletion
- Calls to disconnect_volume fails
- Nova removes everything it can and at the end even the instance record, while it keeps trying to disconnect the device in the background.
- User wants to use the volume again but sees that it's in-use in Cinder
- Looks for the instance in Nova thinking that something may have gone wrong, but not seeing it there thinks it's a problem between cinder and nova.
- Runs the `cinder delete-attachment` command to return the volume to available state.

We end up in the same situation as we were before, with leftover devices.

Revision history for this message

Dan Smith (danms) wrote on 2023-02-13:

#29

Because the user wants to delete a thing in our supposed "elastic infrastructure". They want their quota back, they want to stop being billed for it, they want the IP for use somewhere else, or whatever. They don't care that we can't delete it because of some backend failure - that's not their problem. That's why we have the ability to queue the delete even if the compute is down - that's how important it is.

It's also not at all about deleting the VM, it's about the instance going away from the perspective of the user (i.e. marking the instance record as deleted). The instance record is what determines if they're billed for it, if their quota is used, etc. We "charge" the user the same whether the VM is running or not. Further, even if we have stopped the VM, we cannot re-assign the resources committed to that VM until the deletion completes in the backend. Another scenario that infuriates operators is "I've deleted a thing, the compute node should be clear, but the scheduler tells me I can't boot something else there."

Your example workflow is exactly why I feel like the solution to this problem can't (entirely) be one of preventing a delete if we fail to detach. Because the admins will just force-delete/detach/reset-state/whatever until things free up (as I would expect to do myself). Especially if the user is demanding that they get their quota back, stop being billed, and/or attach the volume somewhere else.

It seems to me that there *must* be some way to ensure that we never attach a volume to the wrong place. Regardless of how we get there, there must be some positive affirmation that we're handing precious volume data to the right person.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-14:

#30

The quota/billing issue is a matter of Nova code. In cinder we resolve it by having a flag for resources (volume and snapshots) to reflect whether they consume quota or not.

The same thing could be done in Nova to reflect what resources are actually consumed by the instance (IPs, VMs, GPUs, etc) and therefore billable.

Users not caring about backend errors would be, in my opinion, naive thinking on their part, since they DO CARE about their persistent data being properly written and they want to avoid data loss, data corruption, and data leakage above all else.

I assume users would also want to have a consistent view of their resources, so if a volume says it's attached to an instance the instance should still exist, otherwise there is an invalid reference.

Data leak/corruption may be prevented in some cases with the code I'm working on for os-brick (although some drivers are missing the feature required), but that won't prevent data loss. For that Nova would need to do the sensible thing.

I'm going to do some additional testings today, because this report is about something that happens accidentally, but I believe there is a way to actually exploit this to gain access to other users data. Though fixing that would require yet another bunch of code.

In other words, there are 3 different to fix here:

- Nova doing the right thing to prevent data corruption/leak/loss.
- os-brick detection of the right volume to prevent data leak.
- Prevent intentional data leak.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2023-02-14:

#31

If there is indeed a way for a normal user (not an operator) of the environment to cause this information leak to happen and then take advantage of it, we should find a way to prevent at least that aspect before making this report public.

If it's not a condition that a normal user can intentionally cause to happen, then it's probably fine to fix this in public instead.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2023-02-14:

#32

Gorka, Nova even doesn't really know about the Cinder backends, it just uses os-brick.

So, when Nova asks to attach a volume, only os-brick knows whether it's the right volume. That's why I think it's important to have brick to be able to say 'no'.

Revision history for this message

Dan Smith (danms) wrote on 2023-02-14:

#33

Right, we have to trust os-brick to give us a block device that is actually the thing we're supposed to attach to the guest.

I'm really concerned about what sounds like a very loose association between what we pass to brick from cinder and what we get back from brick in terms of a block device. Isn't there some way for brick to walk the multipath device and the backing iSCSI/FC devices to check WWNs or something to ensure that it's consistent and points to what we expect?

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2023-02-14:

#34

> If there is indeed a way for a normal user (not an operator) of the environment to cause this information leak to happen and then take advantage of it, we should find a way to prevent at least t hat aspect before making this report public.

Well, I'm trying hard to find a possible attack vector from a malicious user and I don't see any.
I don't disagree with the bug report as it can potentially leak data to any instance, but I don't know how someone could take benefit of this information.

Here, I'm just one voice and I leave others to chime in, but I'm in favor of making this report public so we can discuss the potential solutions with the stakeholders and any operator having concerns about it.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-14:

#35

Let me summarize things:

1. The source of the problem reported in this bug is that Nova has been doing something wrong since forever. I've been bringing this up for the past 7 years, and every single time we end up in the same place, nova giving priority to instance deletion over everything else.

2. There are some things that os-brick can do to try to detect when Nova doesn't do its job right, but this is equivalent to a taxi driver asking passengers to learn to fall because the car is not going to stop when they want to get off. It's a lot harder to do and it doesn't sound all that reasonable.

3. There is an attack vector that can be exploited and it's pretty easy to do (I've done it locally) but it's separate from the issue reported here and it hasn't existed for as long as the that one. I would resolve this in a different way than the workaround mentioned in #2.

Seeing as we are back to the same conversation of the past 7 years, we'll probably end up in the same place, so I'll just do my best to resolve the attack vector and also introduce code to resolve Nova's mistakes.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-14:

#36

Oh, I failed to clarify something. The user exploit case can be made secure (as far as I can tell), but for the scenario in this bug's description, the only secure solution is fixing nova, the os-brick code I'm working on will only reduce the window were the data is leaked or can be corrupted.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2023-02-14:

#37

Gorka, I don't want to debate on projects's responsibility, but I'd rather focus on the data leakage, which is the subject of this security report.

The fact that a volume detach can leave residue if a flush error occurs is certainly not ideal, but this isn't a security problem *UNTIL* the remaining devices are reused.
To me, it appears that the data leal occurs on the attach and not on the detach and I'd rather prefer to see os-brick avoiding this situation.

That being said, I think Melanie, Dan and I agreed on trying to find a way to asynchronously clean up the devices (see comments #24 #25 and #27) and that can be discussed publicly, but again, this won't help with the data leakage that occurs on the attach command.

Revision history for this message

Dan Smith (danms) wrote on 2023-02-14:

#38

Okay Gorka and I just had a nice long chat about things and I think we made some progress on understanding the (several) ways we can get into this situation and came up with some action items. I'll try to summarize here and I'll look for Gorka to correct me if I get anything wrong.

I think that we're now on the same page that delete of a running instance is much more of a forceful act than some might think, and that we expect to try to be graceful with that, but with a limited amount of patience before we kill it with fire. That maps to us actually always calling force=True when we do the detachment. Even with force=True, brick *tries* to flush and disconnect gracefully, but if it can't, will cut things off at the knees. Thus, if we did force=True now, we wouldn't get into the situation the bug describes because we would *definitely* have cleaned up at that point.

It sounds like there are some robustification steps that can be made in brick to do more validation of the full chain from instance->multipathd->iscsi->volume when we're doing attachments to try to avoid getting into the situation described by this bug, so Gorka is going to work on that.

Gorka also described another way to get into this situation, which is much more exploitable by the user, and I'll let him describe it in more detail. But the short story is that cinder should not let users delete attachments for instances that nova says are running (i.e. not deleted).

Multipathd, while well-intentioned, also has some behavior that is counterproductive when recovering from various situations where paths to a device get disconnected. Enabling the recheck_wwid thing in multipathd should be a recommended flag to have enabled to reduce the likelihood of that happening. Especially in the case where nova has allowed a blind delete due to a downed compute node, we need multipathd to not "help" by reattaching things without extra checks.

So, the action items roughly are:

1. Nova should start passing force=True in our call to brick detach for instance delete
2. Recommend the recheck_wwid flag for multipathd, and get deployment tools to enable it
3. Robustification of brick's attach workflow to do some extra sanity checks
4. Cinder should refuse to allow users to delete an attachment for an active volume

Based on the cinder user-exploitable attack vector, it sounds to me like we should keep this bug private on that basis until we have at least the cinder/nova validation step in place. We could create another one for just that scenario, but publicizing the accidental scenario and discussion we have in this bug now might be enough of a suggestion that more people would figure out the user-oriented attack.

Okay Gorka and I just had a nice long chat about things and I think we made some progress on understanding the (several) ways we can get into this situation and came up with some action items. I'll try to summarize here and I'll look for Gorka to correct me if I get anything wrong.

I think that we're now on the same page that delete of a running instance is much more of a forceful act than some might think, and that we expect to try to be graceful with that, but with a limited amount of patience before we kill it with fire. That maps to us actually always calling force=True when we do the detachment. Even with force=True, brick *tries* to flush and disconnect gracefully, but if it can't, will cut things off at the knees. Thus, if we did force=True now, we wouldn't get into the situation the bug describes because we would *definitely* have cleaned up at that point.

It sounds like there are some robustification steps that can be made in brick to do more validation of the full chain from instance->multipathd->iscsi->volume when we're doing attachments to try to avoid getting into the situation described by this bug, so Gorka is going to work on that.

Gorka also described another way to get into this situation, which is much more exploitable by the user, and I'll let him describe it in more detail. But the short story is that cinder should not let users delete attachments for instances that nova says are running (i.e. not deleted).

Multipathd, while well-intentioned, also has some behavior that is counterproductive when recovering from various situations where paths to a device get disconnected. Enabling the recheck_wwid thing in multipathd should be a recommended flag to have enabled to reduce the likelihood of that happening. Especially in the case where nova has allowed a blind delete due to a downed compute node, we need multipathd to not "help" by reattaching things without extra checks.

So, the action items roughly are:

1. Nova should start passing force=True in our call to brick detach for instance delete
2. Recommend the recheck_wwid flag for multipathd, and get deployment tools to enable it
3. Robustification of brick's attach workflow to do some extra sanity checks
4. Cinder should refuse to allow users to delete an attachment for an active volume

Based on the cinder user-exploitable attack vector, it sounds to me like we should keep this bug private on that basis until we have at least the cinder/nova validation step in place. We could create another one for just that scenario, but publicizing the accidental scenario and discussion we have in this bug now might be enough of a suggestion that more people would figure out the user-oriented attack.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-15:

#39

Sylvain, the data leak/corruption presented in this bug report is caused by the detach on the nova side.

It may happen when we do the attach, but it is 100% caused by the detach problem, so just focusing on the attach part is not right considering the RCA is the leftover devices from the detach.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2023-02-15:

#40

Gorka, I eventually understood all the problems we have and what Dan wrote at comment #38 look good to me as action items.

Yeah, we need to keep this bug private for a bit until we figure out a solid plan for fixing those 4 items and yeah, we need to both force-delete the attachment while we also try to solidify the attachment calls.

Revision history for this message

melanie witt (melwitt) wrote on 2023-02-16:

#41

I'm attaching a potential patch for nova to use force=True when calling os-brick disconnect_volume() when an instance is being deleted.

Only the libvirt and hyperv drivers are calling os-brick disconnect_volume() that I found, and it's part of the driver.destroy() path.

This change ended up being larger than expected ... I aimed to add basic test coverage for passing the force kwarg through and there are a lot of volume drivers.

If anyone wants something changed or otherwise finds issues in the patch, please let me know.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-16:

#42

Hi Melanie,

I have tried the patch and works as expected, resolving the most common case of having leftover devices on Compute nodes. Thanks!!

Dan mentioned that the delete of an instance is more in line with a power removal of a computer than a shutdown, and that's why using `force=True` makes sense because it will try to do it cleanly if possible but data loss is possible.

I looked at the API docs [1] for the delete operation and I don't see this idea stated there. Should we update the docs to explicitly state that deleting an instance can result in data loss?

Cheers.

[1]: https://docs.openstack.org/api-ref/compute/?expanded=detach-a-volume-from-an-instance-detail,show-a-detail-of-a-volume-attachment-detail,show-server-action-details-detail,delete-server-detail#delete-server

Revision history for this message

melanie witt (melwitt) wrote on 2023-02-17:

#43

Hi Gorka,

Thank you for trying out the patch!

I agree more detailed docs could be helpful and have proposed a doc update for review:

https://review.opendev.org/c/openstack/nova/+/874188

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-17:

#44

This is the patch I've prepared for Cinder to prevent users from exploiting the data leak issue or even to unintentionally leave leftover devices by deleting the cinder attachment record.

With the nova patch and this one we cover most of the scenarios, but not all, since I've been told that there are scenarios where an instance is deleted without contact with the actual

I have to cleanup the os-brick code, write the unit tests, and see how the "recheck_wwid" multipath config option interacts with it.

I also have to try and see if the issue also happens in FC, in which case I would need to modify the os-brick patch and also write a new one to add support for the "force" parameter in the "disconnect_volume" method.

Since there are some calls to Nova I would appreciate reviews from the Nova team to confirm that I didn't miss anything.

Revision history for this message

Gorka Eguileor (gorka) wrote on 2023-02-17:

#45

I can't reproduce the issue using FC with an HPE 3PAR array, debugging it I found that the compute node receives a signal after the LUN has been remapped (this didn't happen in my iSCSI tests):

Feb 17 13:05:20 localhost.localdomain kernel: sd 3:0:0:0: Power-on or device reset occurred
Feb 17 13:05:20 localhost.localdomain kernel: sd 3:0:1:0: Power-on or device reset occurred

This is detected as a "change" in the block device:

Feb 17 13:05:20 localhost.localdomain systemd-udevd[158430]: 3:0:1:0: /usr/lib/udev/rules.d/60-block.rules:8 ATTR '/sys/devices/pci0000:00/0000:00:05.0/host3/rport-3:0-4/target3:0:1/3:0:1:0/block/sdb/uevent' writing 'change'

Which triggers the code that uses an SCSI command to get the volume's WWID and then updates sysfs to reflect it.

Feb 17 13:05:20 localhost.localdomain systemd-udevd[158430]: sdb: /usr/lib/udev/rules.d/60-persistent-storage.rules:66 Importing properties from results of 'scsi_id --export --whitelisted -d /dev/sdb'

After that rule another one for multipath is triggered to tell multipathd that it needs to check a device:

Feb 17 13:05:20 localhost.localdomain systemd-udevd[158430]: sdb: /usr/lib/udev/rules.d/62-multipath.rules:36 Importing properties from results of '/sbin/multipath -u sdb'

Multipathd detects that the WWID has changed (because sysfs has been updated):

Feb 17 13:05:20 localhost.localdomain multipathd[7007]: sdb: path wwid changed from '360002ac00000000000000b740000741c' to '360002ac00000000000000b750000741c'

And then reconfigures the old multipath device mapper to remove this device:

  Feb 17 13:05:20 localhost.localdomain multipathd[7007]: 360002ac00000000000000b740000741c: reload [0 2097152 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 3 1 8:0 1 8:48 1 8:32 1]
  Feb 17 13:05:20 localhost.localdomain multipathd[7007]: check_removed_paths: sdb: freeing path in removed state
  Feb 17 13:05:20 localhost.localdomain multipathd[7007]: 8:16: path removed from map 360002ac00000000000000b740000741c

And finally the new device mapper is formed:

Feb 17 13:05:21 localhost.localdomain multipathd[7007]: sda [8:0]: path added to devmap 360002ac00000000000000b750000741c

I don't know if this is standard FCP behavior or if this is storage array specific and other storage arrays may not behave like this. I'm trying to get access to a different FC array to confirm.

I can't reproduce the issue using FC with an HPE 3PAR array, debugging it I found that the compute node receives a signal after the LUN has been remapped (this didn't happen in my iSCSI tests):

Feb 17 13:05:20 localhost.localdomain kernel: sd 3:0:0:0: Power-on or device reset occurred
 Feb 17 13:05:20 localhost.localdomain kernel: sd 3:0:1:0: Power-on or device reset occurred