"An error occurred. Press enter to start a shell"

Bug #1946773 reported by dann frazier
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
subiquity
New
Undecided
Unassigned
cloud-init (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

20211011.1 arm64 ISO

I booted the ISO on an Ampere Altra-based Mt. Jade system using BMC emulated remote media. It failed with:

EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary
[ 0.342647] cma: cma_alloc: reserved: alloc failed, req-size: 4096 pages, ret: -12
[ 0.721234] cma: cma_alloc: reserved: alloc failed, req-size: 256 pages, ret: -12
[ 0.728821] cma: cma_alloc: reserved: alloc failed, req-size: 256 pages, ret: -12
[ 0.736385] cma: cma_alloc: reserved: alloc failed, req-size: 256 pages, ret: -12
[ 0.743941] cma: cma_alloc: reserved: alloc failed, req-size: 128 pages, ret: -12
[ 0.752328] cma: cma_alloc: reserved: alloc failed, req-size: 256 pages, ret: -12
[ 0.759898] cma: cma_alloc: reserved: alloc failed, req-size: 256 pages, ret: -12
[ 0.767449] cma: cma_alloc: reserved: alloc failed, req-size: 256 pages, ret: -12
[ 0.775000] cma: cma_alloc: reserved: alloc failed, req-size: 128 pages, ret: -12
[ 0.783122] cma: cma_alloc: reserved: alloc failed, req-size: 256 pages, ret: -12
[ 1.383582] integrity: Couldn't parse db signatures: -74
[ 1.390192] integrity: Couldn't parse dbx signatures: -74
stdin: Invalid argument
passwd: password expiry information changed.
Using CD-ROM mount point /cdrom/
Identifying... [a7f48729e5851123bb63a950e3e865b7-2]
Scanning disc for index files...
Found 2 package indexes, 0 source indexes, 0 translation indexes and 1 signatures
Found label 'Ubuntu-Server 21.10 _Impish Indri_ - Release arm64 (20211011.1)'
This disc is called:
'Ubuntu-Server 21.10 _Impish Indri_ - Release arm64 (20211011.1)'
Copying package lists...gpgv: Signature made Mon Oct 11 18:26:50 2021 UTC
gpgv: using RSA key 843938DF228D22F7B3742BC0D94AA3F0EFE21092
gpgv: Good signature from "Ubuntu CD Image Automatic Signing Key (2012) <email address hidden>"
Reading Package Indexes... Done
Writing new source list
Source list entries for this disc are:
deb cdrom:[Ubuntu-Server 21.10 _Impish Indri_ - Release arm64 (20211011.1)]/ impish main restricted
Repeat this process for the rest of the CDs in your set.
[ 152.236747] systemd[1]: Failed unmounting /cdrom.
[FAILED] Failed unmounting /cdrom.

Ubuntu 21.10 ubuntu-server ttyAMA0

connecting...
waiting for cloud-init...
An error occurred. Press enter to start a shell

Tags: iso-testing
Revision history for this message
dann frazier (dannf) wrote :
description: updated
Revision history for this message
Ubuntu QA Website (ubuntuqa) wrote :

This bug has been reported on the Ubuntu ISO testing tracker.

A list of all reports related to this bug can be found here:
http://iso.qa.ubuntu.com/qatracker/reports/bugs/1946773

tags: added: iso-testing
Revision history for this message
Dan Bungert (dbungert) wrote :

@cloud-init - any thoughts on why `cloud-init status --wait` might exceed 10 minutes?

Revision history for this message
dann frazier (dannf) wrote :

I tried disabling the ISO verification check (as described in bug 1870337), but it didn't seem to help:

root@ubuntu-server:/var/log# cat /proc/cmdline
BOOT_IMAGE=/casper/vmlinuz quiet --- fsck.mode=skip

In fact, it seems like this method does not disable verification:

root@ubuntu-server:/var/log# tail syslog
Oct 12 16:15:34 ubuntu-server casper-md5check[3520]: Checking ./pool/main/o/openssh/openssh-sftp-server_8.4p1-6ubuntu2_arm64.deb..../pool/main/o/openssh/openssh-sftp-server_8.4p1-6ubuntu2_arm64.deb: OK
Oct 12 16:15:37 ubuntu-server casper-md5check[3520]: Checking ./pool/main/o/openvswitch/openvswitch-common_2.16.0-0ubuntu2_arm64.deb..../pool/main/o/openvswitch/openvswitch-common_2.16.0-0ubuntu2_arm64.deb: OK
Oct 12 16:15:41 ubuntu-server casper-md5check[3520]: Checking ./pool/main/o/openvswitch/openvswitch-switch_2.16.0-0ubuntu2_arm64.deb..../pool/main/o/openvswitch/openvswitch-switch_2.16.0-0ubuntu2_arm64.deb: OK
Oct 12 16:15:41 ubuntu-server casper-md5check[3520]: Checking ./pool/main/o/openvswitch/python3-openvswitch_2.16.0-0ubuntu2_all.deb..../pool/main/o/openvswitch/python3-openvswitch_2.16.0-0ubuntu2_all.deb: OK
Oct 12 16:15:41 ubuntu-server casper-md5check[3520]: Checking ./pool/main/o/os-prober/os-prober_1.79ubuntu1_arm64.deb..../pool/main/o/os-prober/os-prober_1.79ubuntu1_arm64.deb: OK
Oct 12 16:15:42 ubuntu-server casper-md5check[3520]: Checking ./pool/main/r/reiserfsprogs/reiserfsprogs_3.6.27-4build3_arm64.deb..../pool/main/r/reiserfsprogs/reiserfsprogs_3.6.27-4build3_arm64.deb: OK
Oct 12 16:15:42 ubuntu-server casper-md5check[3520]: Checking ./pool/main/w/wireless-regdb/wireless-regdb_2021.08.28-0ubuntu1_all.deb..../pool/main/w/wireless-regdb/wireless-regdb_2021.08.28-0ubuntu1_all.deb: OK
Oct 12 16:15:45 ubuntu-server casper-md5check[3520]: Checking ./pool/main/w/wpa/wpasupplicant_2.9.0-21build1_arm64.deb..../pool/main/w/wpa/wpasupplicant_2.9.0-21build1_arm64.deb: OK
Oct 12 16:15:45 ubuntu-server casper-md5check[3520]: Checking ./pool/main/x/xauth/xauth_1.1-1_arm64.deb..../pool/main/x/xauth/xauth_1.1-1_arm64.deb: OK
Oct 12 16:17:01 ubuntu-server CRON[4427]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
root@ubuntu-server:/var/log#

Revision history for this message
dann frazier (dannf) wrote :

Ah, I need to put the fsck.mode param before the "---" apparently. That still feels like a difficult to discover workaround though. In the past I believe there were message on the console that at least made it clear that a verification process was happening w/ a message about using ^c to cancel it. An alternate boot option like "Install Ubuntu (skip ISO verification)" would even be nicer IMO.

Revision history for this message
James Falcon (falcojr) wrote :

Re cloud-init, in the log I see a 3 minute gap between when cloud-init's init stage ends and the modules stage starts, and then I see modules-final stage never starting. Something else in boot is preventing the modules-final stage from starting, so cloud-init will never finish.

Revision history for this message
dann frazier (dannf) wrote :

My theory is that it is the file checksumming, which is slow when using BMC-emulated media.

Revision history for this message
Dan Bungert (dbungert) wrote :

log analysis:

The last output from cloud-init was at 15:17:15
Subiquity started a 10 minute wait on cloud-init status at 15:19:31, finish at 15:29:30
casper-md5check was ongoing as of 15:31:42

Revision history for this message
Taihsiang Ho (tai271828) wrote :

"20211012 arm64 ISO" could reproduce the same issue with Mt. Jade (howzit)

var/log/installer/subiquity-server-debug.log shows subiquity timeout:

2021-10-13 12:47:18,086 ERROR subiquity.server.server:391 top level error
Traceback (most recent call last):
  File "/snap/subiquity/2824/lib/python3.8/site-packages/subiquity/server/server.py", line 579, in start
    await self.wait_for_cloudinit()
  File "/snap/subiquity/2824/lib/python3.8/site-packages/subiquity/server/server.py", line 487, in wait_for_cloudinit
    status_cp = await asyncio.wait_for(status_coro, 600)
  File "/snap/subiquity/2824/usr/lib/python3.8/asyncio/tasks.py", line 501, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

Revision history for this message
dann frazier (dannf) wrote (last edit ):

I think BMC-emulated media is a strong enough use-case for servers that we should do something to address it in a future release. While I don't think many server users would be using this method for mass installs, it seems like a reasonable way to "give Ubuntu a try" on a single piece of kit. It'd be a bummer if this was what their experience. At minimum it'd be nice if the error message hinted that this might be the underlying problem - perhaps by spitting out a link to a "Known Issues" page that describes adding the fsck.mode=skip workaround.

James Falcon (falcojr)
Changed in cloud-init (Ubuntu):
status: New → Incomplete
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Does adding toram to the kernel command line help? It'll probably make boot very slow as it copies everything from the ISO to memory but things should go much faster once that's done (and hey, maybe sequential read performance isn't totally awful for the bmc-emulated media)

Revision history for this message
Taihsiang Ho (tai271828) wrote :

var log collected with impish rc image (211013) with mt. jade (howzit).

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

The crash on timeout is fixed in main fwiw and I'm just doing the necessary stuff to make sure it's fixed in jammy. Did anyone try adding toram to the kernel command line?

Revision history for this message
Alexandre Ghiti (alexghiti) wrote :

For the record, we fell onto the same issue with the RISC-V live installer on the SiFive Unmatched board whose logs are attached. In those logs you'll see a kernel bug that I'm currently investigating which is likely caused by the md5check job.

Adding fsck.mode=skip to the command line fixed the issue for us.

Revision history for this message
dann frazier (dannf) wrote :

Just tested w/ a jammy image, and yeah - toram works, though it looks like it took about 50 minutes from boot to load the ISO over this interface.

Revision history for this message
Ling Zhou (zhouling12) wrote :
Revision history for this message
Ling Zhou (zhouling12) wrote :

I booted jammy-live-server20220405-amd64 ISO on one Lenovo ThinkSystem SD630 V2 product, it failed with the pic attached in #16. Adding boot option to the kernel command line:
1) toram works, but it took about 50~60 minutes from boot to load the ISO over this interface;
2) fsck.mode=skip, could reach the installation interface, but halted at the following after having made some selections during the installation:
subiquity/Early/apply_autoinstall_config
subiquity/Reporting/apply_autoinstall_config
subiquity/Error/apply_autoinstall_config
subiquity/Userdata/apply_autoinstall_config
subiquity/Package/apply_autoinstall_config
subiquity/Debconf/apply_autoinstall_config
subiquity/Kernel/apply_autoinstall_config
subiquity/Zdev/apply_autoinstall_config
subiquity/Late/apply_autoinstall_config
3) cloud-init=disabled,
3.1) could reach the installation interface, though got several "Failed" reporting message as follows
[Failed] Failed to start OpenBSD Secure Shell server.
See 'systemctl status ssh.service' for details.
3.2) installation could be successful though some overlapping messages appeared on the top of the screen

Revision history for this message
Ling Zhou (zhouling12) wrote :

some addition to the #17: the ISO was through XCC(BMC)'s remote mount(XCC(BMC) -> Remote Console -> Media -> Mount Local Media File)

Revision history for this message
dann frazier (dannf) wrote :

I wonder if there is some scope creep in this bug. The bug I filed has this symptom:

---------------------------------------------------------------------
Ubuntu 21.10 ubuntu-server ttyAMA0

connecting...
waiting for cloud-init...
An error occurred. Press enter to start a shell
---------------------------------------------------------------------

And that should be fix committed if I understand comment #13 correctly (possibly commit 05f3db94a?). Is this the symptom you are seeing @zhouling12, meaning the fix maybe incomplete, or are you seeing some other issue also correlated w/ slow/BMC-emulated media?

Revision history for this message
Jeff Lane  (bladernr) wrote :

Given it took so long, are you sure you're not hitting some timeout that's unmounting the ISO?

And can you put the ISO somewhere closer to the server? I've had success sharing it on an NFS share local to the server and mounting that via the XCC.

Revision history for this message
Ling Zhou (zhouling12) wrote :

to Comment 19: please see the pic just attached, the issue we encountered using jammy-live-server20220328-amd64 ISO has almost the same symptom:
---------------------------------------------------------------------
Ubuntu Jammy Jellyfish (development branch) ubuntu-server tty1

connecting...
waiting for cloud-init...
An error occurred. Press enter to start a shell
---------------------------------------------------------------------
When using jammy-live-server202200405-amd64 ISO, the symptom had some change(details in Comment 16) though also failed.

Revision history for this message
Ling Zhou (zhouling12) wrote :

to Comment 20:
1) yes, ISO was mounted for sure, this issue can be reproducible 100%
2) NFS mount via the XCC be OK, as we tested.

Revision history for this message
Jeff Lane  (bladernr) wrote :

OK, so it works via remote media if you mount it locally using NFS.

And does the same hold true for the 20.04.4 LTS Server Live ISO, and the 21.10 Server Live ISO?

And can you install a desktop environment on a server next to the SR650 and on the same switch, rather than your laptop, and see if you can do this from one server to the next?

And is your laptop using wifi or ethernet? If ethernet, is it using a built in RJ45 jack, or a USB dongle? Is that USB dongle USB2 or USB3, and is it plugged into a USB2 or USB3 port?

I'm not seeing a problem here right now with the ISO or the installer, but it sounds much more like an environmental issue given that:

1: you can provide the ISO via NFS to the XCC and boot it
2: You cannot provide the ISO from your laptop to the XCC using locally mounted virtual media.

Point is, I don't think the issue you're seeing is actually related to this bug at this point.

Revision history for this message
dann frazier (dannf) wrote :

> Point is, I don't think the issue you're seeing is actually related to this bug at this point.

The error appears to be the same in the screenshot @zhouling12 provided - so it seems like we are still seeing this issue w/ high-latency media. Or do you have reason to believe the cause is different than what I reported here?

Having lower latency media does seem like a valid workaround but, IMO, high-latency media is a significant enough use case that we should try and support it. I don't know that we can expect users to all have systems "next door" to their install targets to use for media hosting.

Revision history for this message
Jeff Lane  (bladernr) wrote :

One more comment, I cannot reproduce this. To test, I installed Ubuntu Desktop 20.04 LTS onto a SR645 in a rack in my DC. I then accessed that desktop, opened Firefox, and accessed the XCC on a SR670 V2. I mounted the ISO image using local media, and was able to boot the SR670 V2 using an ISO mounted locally via the XCC, and was able to successfully install 22.04 LTS.

So the problem is in your environment somewhere.

Revision history for this message
Jeff Lane  (bladernr) wrote (last edit ):

>Having lower latency media does seem like a valid workaround but, IMO, high-latency media is a significant enough use case that we should try and support it. I don't know that we can expect users to all have systems "next door" to their install targets to use for media hosting.

Fair enough. All I know is that I was unable to reproduce this even when mounting the ISO locally and going through three or four layers of web browser - BMC console - web browser - BMC console with desktop running on a server in the lab talking to another server in the lab.

Latency is a problem, for sure. I tried doing this the other day by mounting that ISO on my local desktop and booting the machine from few hundred miles away on a 1.5Mb/s uplink and it was... not pleasant. I suspect Zhou Ling is doing this from a windows laptop over wifi through multiple layers of access point/router/switch (though the wifi is a suspicion, and I wonder if this goes away if Zhou Ling plugs that laptop into the same network the server BMC is on using a GigE dongle and some cat-5).

Anyway, not my call to fix or not fix, it seemed to me that these were similar but not the same, but I am happy to be wrong on that.

Given that, I'm now curious about how this was set up on the Ampere server too, and if anyone tried what I did by putting Desktop on a local machine and mounting the ISO from there, rather than from a remote location outside the DC.

I suspect you were doing this from home as well (am I right?) and even with your uplink which has to be better than mine, the latency is still too great.

Revision history for this message
dann frazier (dannf) wrote : Re: [Bug 1946773] Re: "An error occurred. Press enter to start a shell"

On Wed, Apr 27, 2022 at 2:25 PM Jeff Lane <email address hidden> wrote:
> Given that, I'm now curious about how this was set up on the Ampere
> server too, and if anyone tried what I did by putting Desktop on a local
> machine and mounting the ISO from there, rather than from a remote
> location outside the DC.
>
> I suspect you were doing this from home as well (am I right?) and even
> with your uplink which has to be better than mine, the latency is still
> too great.

Yeah, I'm pretty sure I was using a browser on my laptop to feed media
to a server in the data center over a VPN. When I'm testing I
typically prioritize ease of setup over deployment time, because I can
just let the install go in the background and do other things while it
runs.

Revision history for this message
Ling Zhou (zhouling12) wrote :

> OK, so it works via remote media if you mount it locally using NFS.

> And can you install a desktop environment on a server next to the SR650 and on the same switch, rather than your >laptop, and see if you can do this from one server to the next?
Not only did we test using NFS, but using direct connection with the laptop, also OK.

> And is your laptop using wifi or ethernet?
wifi. ethernet is seldom used in notebook, as i know.

> I'm not seeing a problem here right now with the ISO or the installer
Pls see the pic attached in Comment 16,
"
cloud failed to complete after 10 minutes of waiting. This
suggests a bug, which we would appreciate help understanding. If
you could file a bug at
https://bugs.launchpad.net/subiquity/+filebug and attach the
contents of /var/log, it would be most appreciated.
"
now we did, reported, subiquity's bug, don't you think the issue we reported is not the same? or we can file a new one?

> but it sounds much more like an environmental issue given that:
>1: you can provide the ISO via NFS to the XCC and boot it
>2: You cannot provide the ISO from your laptop to the XCC using locally mounted virtual media.
I think ISO can be provided both via NFS and locally mounted, based on
1) when the issue happened, we double checked the ISO was mounted locally, never gone
2) when using ISO locally mounted, "toram" can works

Revision history for this message
Ling Zhou (zhouling12) wrote :

> Latency is a problem, for sure. I tried doing this the other day by mounting that ISO on my local desktop and booting
> the machine from few hundred miles away on a 1.5Mb/s uplink and it was... not pleasant. I suspect Zhou Ling is doing
> this from a windows laptop over wifi through multiple layers of access point/router/switch

Yes, I used my notebook with WIN10 OS and wifi network to do this. talk something abt my network environment: the tested server is at our lab, which is the same floor as mine, the direct distance is just 10~20 meters. though wifi used, not over a VPN, may through access point/router/switch, it may hav lower latency than practical as I think.

Revision history for this message
Ling Zhou (zhouling12) wrote :

to Jeff:
after received you email(2022/04/28), we used the ISO you provided
(https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Freleases.ubuntu.com%2F22.04%2Fubuntu-22.04-live-server-amd64.iso&amp;data=05%7C01%7Czhouling12%40lenovo.com%7C05be528e37df4e9b45c208da2889976d%7C5c7d0b28bdf8410caa934df372b16203%7C0%7C0%7C637866868672059200%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=f8ktLGF6FRgDmR8JszPtNtU%2B6d%2BenXjokuHm47BJWkI%3D&amp;reserved=0)
to do following testing:
both used one notebook with Windows OS and wifi network to access the XCC of a Lenovo ThinkSystem server;
and to keep objectivity, I asked dedicated testers to do the testing:

1) the tester from my team to do the test:
network environment:
the notebook and the server - SR650V2 used were at the same lab, very close
   also got cloud-init failure reporting, pls see the pic -- test1-close.png attached,but when selecting "Close" option, the installation could continue, and complete; after installation completes, the OS can boot normally
2) the tester from PA to do the test, he is also this bug reporter in our bugzilla:
network environment:
the notebook and the server - SD630V2 used were at the same floor, the direct distance may tens of meters. but not over a VPN, may through access point/router/switch/.
   got cloud-init failuare reporting, as in the pic -- test2-1-close.png attached ->
   selected "Close" to continue ->
   got no driver found reporting as in the pic --test2-2-no driver.png->
   got install failure -- test2-3-install fail.png, at this moment, we double checked that the ISO was for sure mounted, pls see test2-4-ISO.png.

from the testcase 2), I can assume that if one customer of ours uses a browser on his notebook to a server in the DC through wifi over a VPN to install ubuntu2204 via locally mounted media, also will get failure. There exist such customers like dann frazier, I think. If he asks us to resolve this, what shall we say to him? put your notebook closer to the server or not use wifi? BMC products announce just as invented that the administrators can have
a coffee outside and do anything they want to the servers in DC. And standardly current notebooks use wifi network. I agree with Dann,"high-latency media is a significant enough use case that we should try and support it.", and at the same time since our XCC has the function of mounting the ISO locally and install it to the target server, could you pls help us to resolve this issue? thank you very much!

Revision history for this message
Brett Holman (holmanb) wrote :

I agree with falcojr and dbungert's analysis (#6 and #8, respectively) on cloud-init's role in this issue; this looks like a symptom that gets reported via cloud-init not an issue caused by it.

Changed in cloud-init (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Jeff Lane  (bladernr) wrote :

so is there any follow up here? Lenovo is also asking about this and still having issues

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.