cannot deploy ESXi to non-sda drive

Bug #1925722 reported by Andrey Grebennikov
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
curtin
Fix Released
Undecided
Dan Bungert

Bug Description

MAAS 2.9

Created a custom image of ESXi (but I believe it is applicable to any custom image), uploaded to MAAS.

The machine contains 2 devices and I'm willing to use "sdb" as the datastore/root/everything.

Curtin doesn't respect "custom" config of the storage and follows "simple" flow and deploys the OS on "sda", more precisely to the first block device in the array (see last two lines of the log below)

(I'm emulating this in VMs hence vda and vdb are in the logs, however the same behavious happens on the physical machines).

I'm setting up via the UI, setting up boot drive to be "sdb" and select "create custom layout" to be VMFS6.

Once the deployment is started I check the get-curtin-config and it looks right.
However, the curtin log shows the following:

2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: get_blockdev_sector_size: info:
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: {
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "vdb": {
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "ALIGNMENT": "0",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "DISC-ALN": "0",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "DISC-GRAN": "0",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "DISC-MAX": "0",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "DISC-ZERO": "0",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "FSTYPE": "",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "GROUP": "disk",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "KNAME": "vdb",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "LABEL": "",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "LOG-SEC": "512",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "MAJ:MIN": "252:16",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "MIN-IO": "512",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "MODE": "brw-rw----",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "MODEL": "",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "MOUNTPOINT": "",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "NAME": "vdb",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "OPT-IO": "0",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "OWNER": "root",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "PHY-SEC": "512",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "RM": "0",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "RO": "0",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "ROTA": "1",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "RQ-SIZE": "128",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "SIZE": "21474836480",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "STATE": "",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "TYPE": "disk",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "UUID": "",
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: "device_path": "/dev/vdb"
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: }
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: }
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: get_blockdev_sector_size: (log=512, phys=512)
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: Running command ['lsblk', '--noheadings', '--bytes', '--pairs', '--output=ALIGNMENT,DISC-ALN,DISC-GRAN,DISC-MAX,DISC-ZERO,FSTYPE,GROUP,KNAME,LABEL,LOG-SEC,MAJ:MIN,
MIN-IO,MODE,MODEL,MOUNTPOINT,NAME,OPT-IO,OWNER,PHY-SEC,RM,RO,ROTA,RQ-SIZE,SIZE,STATE,TYPE,UUID'] with allowed return codes [0] (capture=True)
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: Checking if /dev/vdb is a swap device
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: Found swap magic: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: wiping superblock on /dev/vdb
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: wiping /dev/vdb attempt 1/4
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: wiping 1M on /dev/vdb at offsets [0, -1048576]
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: successfully wiped device /dev/vdb on attempt 1/4
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: Running command ['udevadm', 'info', '--query=property', '/dev/vdb'] with allowed return codes [0] (capture=True)
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: devname '/sys/class/block/vda' had holders: []
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: devname '/sys/class/block/vdb' had holders: []
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: finish: cmd-install/stage-partitioning/builtin/cmd-block-meta/clear-holders: SUCCESS: removing previous storage devices
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: blockmeta: detected dd-images, using mode=simple
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: 'custom' mode but multiple devices given. using first found
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: mode is 'custom'. multiple devices given. using '/dev/vda' (first available)
2021-04-23T03:50:51-05:00 kw2 cloud-init[1260]: installing in 'custom' mode to 'vda'

This comes from the curtin code here
https://github.com/canonical/curtin/blob/d49d35bc6643b063f085d870ea94a53677ae141c/curtin/commands/block_meta.py#L104

The code finds a dd image and gets into the "meta_simple" method instead of "meta_custom" one, which leads to have an OS installed to the first device in the list always.

Tags: fr-1306 sts

Related branches

description: updated
Dan Bungert (dbungert)
Changed in curtin:
status: New → In Progress
assignee: nobody → Dan Bungert (dbungert)
tags: added: fr-1306
Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :

added messages file captured during the VM deployment.

Revision history for this message
Dan Bungert (dbungert) wrote :

Reformatted config file - easier to read than the log file version.

Revision history for this message
Dan Bungert (dbungert) wrote :

Here is the proposed change.

https://pastebin.ubuntu.com/p/cvNd4Kbcky/

Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :

here is the physical server's full installation log for visibility
https://pastebin.ubuntu.com/p/Vktdf3NR2t/

Revision history for this message
Ryan Harper (raharper) wrote :

Why do the disks on physical or virtual not have a serial number?

Looking at the physical logs, they certainly do have serial numbers.

2021-04-24T01:12:48-05:00 dell2 cloud-init[1606]: get_path_to_storage_volume for volume sda
2021-04-24T01:12:48-05:00 dell2 cloud-init[1606]: Processing serial 6842b2b01b006d0024858f55183677de via udev to 6842b2b01b006d0024858f55183677de
2021-04-24T01:12:48-05:00 dell2 cloud-init[1606]: lookup_disks found: ['wwn-0x6842b2b01b006d0024858f55183677de', 'scsi-36842b2b01b006d0024858f55183677de', 'wwn-0x6842b2b01b006d0024858f55183677de-part1', 'scsi-36842b2b01b006d0024858f55183677de-part1']
2021-04-24T01:12:48-05:00 dell2 cloud-init[1606]: lookup_disks realpath(wwn-0x6842b2b01b006d0024858f55183677de)=/dev/sda

Is it not included in your custom storage config? No serial numbers in the config.

> Here is the proposed change.
> https://pastebin.ubuntu.com/p/cvNd4Kbcky/

The risk here, especially with physical systems, is that the path (/dev/sda) is not reliable. That is, from boot to boot, the path to a specific disk may differ (sda may point to a different physical device).

MAAS certainly knows (physical system) or can construct (virtual) disks with serial numbers to ensure consistent and reliable deployments.

I would suggest this is not a bug, but rather configuration error (missing serial number in disk config).

Revision history for this message
Dan Bungert (dbungert) wrote :

I think I understand what's happening. I'll finish up a proposed fix this morning and test it and report back.

Revision history for this message
Dan Bungert (dbungert) wrote :

The virtual case listed in the original report doesn't have serials. I have been told by Lee Trager that we can assume we get the serial value in the virtual case. I can't ask MAAS to supply a value it doesn't have.

The physical case, in the newer log, does have the serial, and still has the problem. Diff pending that should address that item.

Revision history for this message
Dan Bungert (dbungert) wrote :

*that we CAN'T assume we get the serial value in the virtual case

Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :

Ryan, probably a bit of misunderstanding.
The absence of serial only happens in case of the VM deployment.
With the physical server the serial is always in place.
I originally thought that I can reproduce the problem with the VM (still waiting for the requestor to share the full log), but later I found a way to deploy onto a physical server and I shared a log from it in my last comment.

Revision history for this message
Dan Bungert (dbungert) wrote :

Here is my updated and untested proposal. I still plan to test this this morning.
https://pastebin.ubuntu.com/p/DfQqmqDZQJ/

Revision history for this message
Ryan Harper (raharper) wrote :

> Curtin doesn't respect "custom" config of the storage and follows "simple" flow and deploys the OS on "sda", more precisely to the first block device in the array (see last two lines of the log below)

MAAS and curtin only support writing custom images as opaque blobs with DD. When the image is dd-* type, curtin deploys via meta_simple as it's going to just write out the image to the specified target device.

As with all curtin configurations you should specify the disk serial number of the target disk on which you expect to install your image.

Revision history for this message
Ryan Harper (raharper) wrote :

> but later I found a way to deploy onto a physical server and I shared a
> log from it in my last comment.

Thanks; this is helpful.

'storage': {'config': [
    {'id': 'sda', 'model': 'PERC H700', 'name': 'sda',
      'serial': '6842b2b01b006d0024858f55183677de',
      'type': 'disk', 'wipe': 'superblock'},
    {'grub_device': True, 'id': 'sdb', 'model': 'MP0402H',
     'name': 'sdb', 'ptable': 'gpt', 'serial': '581270040222',
     'type': 'disk', 'wipe': 'superblock'},
...
]

Two disks identified, and required them to be wiped. Then the
dd-image will take the meta_simple path and sdb is selected as the
target (due to grub_device: True).

Then here's the bug:

'custom' mode but multiple devices given. using first found
2021-04-24T01:12:50-05:00 dell2 cloud-init[1606]: mode is 'custom'. multiple devices given. using '/dev/sda' (first available)

In the code, we example args.devices which was set to both disks
due to including both of them in the storage config for wiping.

The logic which looks through the list of devices when more than
one is supplied did not account for diskPath being set.

The cleanest fix here is to only select from the devices list if
devpath did not get set.

diff --git a/curtin/commands/block_meta.py b/curtin/commands/block_meta.py
index cf6bc025..e2f37201 100644
--- a/curtin/commands/block_meta.py
+++ b/curtin/commands/block_meta.py
@@ -2007,17 +2007,20 @@ def meta_simple(args):
     elif len(devices) == 0 and devpath:
         devices = [devpath]

- if len(devices) > 1:
- if args.devices is not None:
- LOG.warn("'%s' mode but multiple devices given. "
- "using first found", args.mode)
- available = [f for f in devices
- if block.is_valid_device(f)]
- target = sorted(available)[0]
- LOG.warn("mode is '%s'. multiple devices given. using '%s' "
- "(first available)", args.mode, target)
+ if devpath is not None:
+ target = devpath
     else:
- target = devices[0]
+ if len(devices) > 1:
+ if args.devices is not None:
+ LOG.warn("'%s' mode but multiple devices given. "
+ "using first found", args.mode)
+ available = [f for f in devices
+ if block.is_valid_device(f)]
+ target = sorted(available)[0]
+ LOG.warn("mode is '%s'. multiple devices given. using '%s' "
+ "(first available)", args.mode, target)
+ else:
+ target = devices[0]

Revision history for this message
Ryan Harper (raharper) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :

> Here is my updated and untested proposal. I still plan to test this this morning.
> https://pastebin.ubuntu.com/p/DfQqmqDZQJ/
>
> * Some disks may not have serial indicated, and that's OK. Use the
> supplied path instead if present.

I would avoid relaxing this to prevent using the wrong disk. MAAS knows the serial
numbers of the disks.

> * grub_device means we mean to image THAT device, so we shouldn't keep
> the entire list of devices as plausible image targets.

grub_device can be set on more than one disk. In which case we still need to handle the case where we've been given to valid targets and selecting one.

+ if i.get("grub_device"):
+ serial = i.get("serial")
+ if serial:
+ devices = [block.lookup_disk(serial)]
+ LOG.info("choosing device %s based on grub_device flag "
+ "and serial", devices[0])
+ break

This is the general idea; as soon as we meet the selection criteria
we can exit the loop and we have THE specified disk. Note that
block.lookup_disk(serial) may not find the disk in which case it throws
a ValueError; I suggest we continue searching of lookup fails.

Revision history for this message
Michael Duarte (mikduart) wrote :
Revision history for this message
Michael Duarte (mikduart) wrote :

Above is the attached log from our use case that Dan requested

Revision history for this message
Dan Bungert (dbungert) wrote :

Michael, thanks for the log. I think it looks similar to the https://bugs.launchpad.net/curtin/+bug/1925722/comments/4 log so I think we're on the right track. This was helpful as it provides some level of confirmation that we're chasing the right issue.

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit def7d5b1 to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=def7d5b1

Changed in curtin:
status: In Progress → Fix Committed
Seyeong Kim (seyeongkim)
tags: added: sts
Revision history for this message
Dan Bungert (dbungert) wrote : Fixed in curtin version 21.3.

This bug is believed to be fixed in curtin in version 21.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
Revision history for this message
Tom Kivlin (tomkivlin) wrote :

Thank you.

Do you know which version of MAAS that version of curtin is included with, how to find out, or how to update curtin within a MAAS installation? I've had a quick web search but struggling to find out.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.