Comment 32 for bug 1828617

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Thanks for testing. That should rule out udev as the cause of the race.

A couple of observations from the log:

* There is a loop for each osd that calls 'ceph-volume lvm trigger' 30 times until the OSD is activated, for example for 4:
[2019-05-31 01:27:29,235][ceph_volume.process][INFO ] Running command: ceph-volume lvm trigger 4-7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:35,435][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.4 with fsid 7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:35,530][systemd][WARNING] command returned non-zero exit status: 1
[2019-05-31 01:27:35,531][systemd][WARNING] failed activating OSD, retries left: 30
[2019-05-31 01:27:44,122][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.4 with fsid 7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:44,174][systemd][WARNING] command returned non-zero exit status: 1
[2019-05-31 01:27:44,175][systemd][WARNING] failed activating OSD, retries left: 29
...

I wonder if we can have similar 'ceph-volume lvm trigger' calls for WAL and DB devices per OSD. Does that even make sense? Or perhaps another call with a similar goal. We should be able to determine if an OSD has a DB or WAL device from the lvm tags.

* The first 3 osd's that are activated are 18, 4, and 11 and they are the 3 that are missing block.db/block.wal symlinks. That's just more confirmation this is a race:
[2019-05-31 01:28:03,370][systemd][INFO ] successfully trggered activation for: 18-eb5270dc-1110-420f-947e-aab7fae299c9
[2019-05-31 01:28:12,354][systemd][INFO ] successfully trggered activation for: 4-7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:28:12,530][systemd][INFO ] successfully trggered activation for: 11-33de740d-bd8c-4b47-a601-3e6e634e489a