module ssh-authkey-fingerprints fails Input/output error: /dev/console

Bug #1473527 reported by Scott Moser
46
This bug affects 8 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Occasionally in vivid and wily I see:
  util.py[WARNING]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>) failed

looking at the log shows:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 684, in _run_modules
    cc.run(run_name, mod.handle, func_args, freq=freq)
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 63, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 198, in run
    results = functor(*args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py", line 105, in handle
    key_entries, hash_meth)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py", line 91, in _pprint_key_entries
    stderr=False, console=True)
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 434, in multi_log
    with open(conpath, 'w') as wfh:
  OSError: [Errno 5] Input/output error: '/dev/console'

the command line on this system looks like:
$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.19.0-22-generic root=UUID=f2b9580d-c4c4-4ba0-8c76-42c99059932f ro --- console=ttyS1,115200

The console is attached to an ipmi serial console.

ProblemType: Bug
DistroRelease: Ubuntu 15.04
Package: cloud-init 0.7.7~bzr1091-0ubuntu3
ProcVersionSignature: User Name 3.19.0-22.22-generic 3.19.8-ckt1
Uname: Linux 3.19.0-22-generic x86_64
ApportVersion: 2.17.2-0ubuntu1.1
Architecture: amd64
Date: Fri Jul 10 18:12:56 2015
PackageArchitecture: all
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: cloud-init
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

My current guess on this is that the issue stems from
 /lib/systemd/system/cloud-final.service
having:
 StandardOutput=journal+console

and then also trying to directly write to /dev/console via 'multi_log'.

Revision history for this message
Scott Moser (smoser) wrote :

i attempted to just have multiple programs open /dev/console (or /dev/ttyS1) and catch a failure to write in parallel, but couldnt seem to catch it. heres the attempt.

Scott Moser (smoser)
Changed in cloud-init (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Scott Moser (smoser) wrote :
Jon Grimm (jgrimm)
Changed in cloud-init (Ubuntu):
assignee: nobody → Wesley Wiedenmeier (wesley-wiedenmeier)
Revision history for this message
Wesley Wiedenmeier (wesley-wiedenmeier) wrote :

I spent a while trying to reproduce this well:

  - I tried many vm configurations running 'serial-shell-looper', both manually
    and started by cloud-init but it didn't break
  - I wrote a similar script using util.meta_log to see if the difference in
    implementation between python open() and shell piping would make a
    difference. I wasn't able to find anything useful though.
    http://paste.ubuntu.com/15843875/
  - I was able to reproduce about 9 times out of 10 using XenialTestBasic with
    no modifications. After removing most of the functionality of the test
    other than basic booting (no curtin cmd, no curtin archive, no extra
    disks), it still failed just as reliably. There were still occasional cases
    where there was no failure though
  - Since these failures have been occurring much more recently, I reverted the
    net.ifnames=0 removal and ran vmtests several times, and did not see any
    failures. I enabled and disabled this parameter many times to make sure,
    but it appears that this issue appears almost always with ifnames enabled
    and never with them disabled, suggesting that somehow that I haven't
    figured out yet naming of network devices is shifting timing enough that it
    can toggle this error on and off
  - I was able to reproduce it just as often using a modified version of the
    cloud-init vmtests, using both a cloud-init deb built from the current
    revision of cloud-init and a deb built from cloud-init at revision 1188,
    before the new networking code was merged in. In both versions, this error
    almost always occurred when running with ifnames enabled and never occurred
    when running with ifnames disabled.

I'm not really sure how to reproduce this error on a small scale yet. I am
going to try to figure out what could be running concurrently with
cc_ssh_authkey_fingerprints and see if I can figure anything else out from
there. I haven't yet tried disabling StandardOutput=journal+console in
cloud-final.service, but I will give that a try as well, although it is already
present in wily and wily does not seem to have this issue

The only idea I have so far for underlying cause is flow control on
/dev/console. Since serial console is being forwarded to a file over ipmi by
qemu and is write only it may be possible that somehow something expects to
read from there (maybe agetty?) and flow control is causing a block. I'm not
sure if that makes sense though. The main thing that suggests that is a series
of bugs in several different mailing lists about syslog-ng writing directly to
/dev/console causing hangs in some situations such as when traffic from
/dev/console is being forwarded to a device that temporarily goes offline,
causing write to block.
http://comments.gmane.org/gmane.comp.syslog-ng/10561

Revision history for this message
Wesley Wiedenmeier (wesley-wiedenmeier) wrote :

I don't think that this is related to the cloud-final.service file including StandardOutput=journal+console, the issue occurs in the same situations as before with standard output being redirected only to journal, or to console or to neither

Revision history for this message
Wesley Wiedenmeier (wesley-wiedenmeier) wrote :

Something interesting I noticed while looking into this is that when xenial images are booted without "net.ifnames=0" the /var/lib/cloud/instance/ directory is not populated. Even if no datasource is provided for the image it should still create a blank cloud-config.txt and the directory structure in /var/lib/cloud/ should be set up, so I think that there is a bug somewhere else in cloud-init causing this. I think there is a good chance that this is what's triggering the issue with cc_ssh_authkey_fingerprints we're seeing here, but I am not sure how yet. I want to try to figure out why this is happening though, because in some conditions, like in curtin vmtests XenialTestNetwork everything is working correctly.

The issue with /var/lib/cloud/instance/ seems to occur both with current cloud-init and with cloud-init at revision 1188, so it is not linked to the new networking code.

Revision history for this message
Wesley Wiedenmeier (wesley-wiedenmeier) wrote :

I modified the script that I wrote to try to reproduce this error and was able to consistantly see IO errors when the script was added to the cloud-config-archive being passed to an image. Instead of writing only to either /dev/console or the real serial device which was /dev/ttyS0 here, each thread writes to both and calls .flush() on one after the other. Without writing to both /dev/console and /dev/ttyS0 I have not seen any IOErrors, so I believe that this may have something to do with the mechanism for /dev/console being forwarded into /dev/ttyS0.

I have not yet been able to cause an IO error using this script on its own, only when it is being run by cloud-init. I think this might be because of the cloud-final.service file specifying output from cloud-init to be written to console as well, causing another simultaneous write. I will work on trying to modify this script so that it can trigger the error when being run on its own as well.

I think that in the original error occuring during curtin vmtests some other process may have been writing to /dev/ttyS0 as cc_ssh_authkeys was writing to /dev/console via util.multi_log, causing this error to occur

With this script in cloud-init userdata I have seen at least one or two OSErrors in serial output each time the system has been tested.

New script: http://paste.ubuntu.com/15906498/
Example serial.log output from launch framework similar to curtin vmtests: http://paste.ubuntu.com/15906571/
There is a well formatted stacktrace on line 6730 and several others higher up that are difficult to read

Revision history for this message
Paride Legovini (paride) wrote :

This is now consistently happening with the Groovy daily images on i3.metal EC2 instances. Logs attached.

Revision history for this message
Paride Legovini (paride) wrote :

On Focal:

root@ip-172-31-24-163:~# ls -l /dev/console
crw------- 1 root root 5, 1 Sep 21 16:07 /dev/console
root@ip-172-31-24-163:~# echo x > /dev/console
root@ip-172-31-24-163:~#

On Groovy:

root@ip-172-31-20-184:~# ls -l /dev/console
crw--w---- 1 root tty 5, 1 Sep 21 16:03 /dev/console
root@ip-172-31-20-184:~# echo x > /dev/console
bash: echo: write error: Input/output error

Revision history for this message
Paride Legovini (paride) wrote :

Focal:

root@ip-172-31-24-163:~# dmesg | grep console
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-1024-aws root=PARTUUID=a2f52878-01 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295 panic=-1
[ 2.184073] ACPI: SPCR: console: uart,mmio,0xc5a00000,115200
[ 2.189603] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-1024-aws root=PARTUUID=a2f52878-01 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295 panic=-1
[ 3.954119] printk: console [tty1] enabled
[ 4.022533] printk: console [ttyS0] enabled
[ 4.968930] printk: console [ttyS0] disabled
[ 5.910255] printk: console [ttyS0] enabled

Groovy:

root@ip-172-31-20-184:~# dmesg | grep console
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.8.0-1004-aws root=PARTUUID=654e5f57-01 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295 panic=-1
[ 1.440395] ACPI: SPCR: console: uart,mmio,0xc5a00000,115200
[ 1.445875] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.8.0-1004-aws root=PARTUUID=654e5f57-01 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295 panic=-1
[ 3.207680] printk: console [tty1] enabled
[ 3.277829] printk: console [ttyS0] enabled
[ 3.561696] fbcon: Taking over console

Note that Groovy has:

[ 3.561696] fbcon: Taking over console

Maybe in newer kernels /dev/console is taken over by a broken framebuffer console?

Revision history for this message
Paride Legovini (paride) wrote :

This is a linux-aws kernel issue, filed:

https://bugs.launchpad.net/cloud-images/+bug/1896604

We may want to discuss if cloud-init should handle errors in writing to /dev/console more gracefully.

Changed in cloud-init (Ubuntu):
assignee: Wesley Wiedenmeier (wesley-wiedenmeier) → nobody
status: Confirmed → New
importance: Medium → Undecided
Revision history for this message
Paride Legovini (paride) wrote :

I reset the bug status as after >4 years it may be time to retriage it.

Revision history for this message
Paride Legovini (paride) wrote :

The problem I hit (LP: #1896604) is not a cloud-init bug.

It seems that nobody stumbled on the original issue in the last ~4 years. Lacking activity and a reproducer I'm marking this as Incomplete, let's see if anybody chimes it claiming it still affects them.

Changed in cloud-init (Ubuntu):
status: New → Incomplete
Revision history for this message
John Chittum (jchittum) wrote :

I can reproduce locally doing the following (fails on focal and groovy). I did this using the cloud-images.ubuntu.com VMDK, Virtualbox, and a simple cloud-init with ssh key and username/pw w/ ssh pw auth on. I got the following error:

```
2020-10-28 13:32:33,680 - util.py[WARNING]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>) failed
2020-10-28 13:32:33,680 - util.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 402, in multi_log
    wfh.flush()
OSError: [Errno 5] Input/output error
```

TESTING:

Requirements:

Virtualbox
cloud-localds command (to make your local ISO of cloud-init data)

1. download the latest VMDK (groovy): http://cloud-images.ubuntu.com/groovy/current/
2. make a simple cloud-init w/ ssh pub key and pw login (to ensure you can get in)

#cloud-config
chpasswd:
list: |
  ubuntu:ubuntu
expire: False
ssh_pwauth: True
ssh_authorized_keys: <YOUR_PUB_KEY>

3. create the iso using cloud-localds

` cloud-localds my-seed.iso my-cloud-config.yaml `

4. Open Virtualbox
5. Create New
6. Under Hard Disk, choose "Use Existing" and choose the VMDK and create
7. select image on Left, right click and choose Settings, or hit ctrl+s
8. go to Storage
9. Click on the +CD(Add Optical Drive) beside Controller: IDE
10. select your cloud-init iso (my-seed.iso)
11. click on drive, and set Attributes: Optical Drive to IDE Primary Master and hit OK
12. Boot machine

Vbox will show you the console as it goes along. once booted

1. login with provided pw
2. `cloud-init status` will show failed
3. check cloud-init log

Revision history for this message
John Chittum (jchittum) wrote :

This can also be configured to pass by having virtualbox provide a serial console connect:

1. select image on Left, right click and choose Settings, or hit ctrl+s
2. Choose Serial Ports
3. Enable Port 1 as follows:
  a. Enable
  b. COM1
  c. IRQ: 4
  d. I/O Port: 0x3F8
  e. Port Mode: Raw File
  f. Path/Address: <INSERT PATH AND FILE NAME>

After connecting this, I observed that cloud-init finished all parts successfully.

Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks John for the reproducer steps and also for the options for workaround this issue. It does seem like a valid bug that cloud-init could address.

Changed in cloud-init (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Paride Legovini (paride) wrote :

Thanks John for the reproducer.

(Per standup discussion) what we can do is probably handle the case where /dev/console exists but can't be written to more gracefully. I think this is the only place where /dev/console is written to:

https://github.com/canonical/cloud-init/blob/master/cloudinit/util.py#L399

Revision history for this message
Scott Moser (smoser) wrote :

There are a slew of bugs and apparently random fallout of /dev/console not working.
In my opinion, its really *not* helpful to just ignore that 'write' to stdout fails. Ignoring errors is never really a solution. In this case, cloud-init may have specifically opened /dev/console. But in other cases, it just writes to its stdout or stderr. It does not know that that output is destined for /dev/console or a file, and should not just ignore the errors.

The real fix for this is to fix the kernel or the OS in some way so that writes to /dev/console always succeed.

Revision history for this message
Dan Watkins (oddbloke) wrote :

The code causing the failure (which Paride linked to) does specifically know that the output is destined for /dev/console:

    if console:
        conpath = "/dev/console"
        if os.path.exists(conpath):
            with open(conpath, 'w') as wfh:
                wfh.write(text)
                wfh.flush()

I agree that we should not generically ignore issues with writing, but this would be a targetted fix for specifically this case.

The general issue here, AIUI, is that the kernel command line and the cloud have to be in agreement about how /dev/console is configured: if the kernel command line specifies a console then the kernel will configure one, even if there is no corresponding console provided by the cloud. On first boot, users have no way of aligning the kernel's default configured console with the console that the cloud provides (or, rather, the lack thereof), so cannot do anything to avoid this traceback.

(Cloud-specific Ubuntu images generally have this configured correctly, but if you're bringing a generic cloud image to $platform, then there's no guarantee that they will be aligned.)

Revision history for this message
Ingo Bauersachs (ibauersachs) wrote :

This also affects .ova cloud-image deployments on ESXi/vSphere, not surprising since it's similar to what John describes in comment 15. I think this is caused by the kernel parameter console=ttyS0 and the underlying VM having no serial port. This should not fail cloud-init and a similar bug was fixed in #1573095.
Commit b794d426b9ab43ea9d6371477466070d86e10668 introduces a stdout fallback when /dev/console does not exist. How about doing that fallback also when writing fails?

$ dmesg | grep console
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-72-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
[ 0.047833] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-72-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
[ 0.106603] printk: console [tty1] enabled
[ 0.119391] printk: console [ttyS0] enabled
[ 3.883073] systemd[1]: Starting Set the console keyboard layout...
[ 4.031904] systemd[1]: Finished Set the console keyboard layout.

Revision history for this message
Onni Rautanen (oizone) wrote :

Ran into this problem with VMware / OVA setup, took a really long time to figure out what the problem was. But now after realizing that the problem is the non-functional /dev/console, isn't thix problem clearly in the image creation?
Since the OVA doesn't have a serial port configured into the VM, the kernel console parameter shouldn't have ttyS0 in it. Right?
So fix is just to remove it?

Revision history for this message
Andrew Lee (andrewlee-microsoft) wrote :

I've also hit this (on VMware), and while I suspect that may be due to some particular funnies of the OS I am on, I agree a defensive fix in cloud-init to not raise an Exception if you can't write to /dev/console and fall back to writing to stdout would be welcomed.

Any objections if I raise a PR to do that?

Revision history for this message
James Falcon (falcojr) wrote :

Andrew, yes, I think that's a reasonable solution.

Revision history for this message
Tony Casanova (tony-casanova-nc-33-2017) wrote :

I just hit this today (Nutanix AHV) cloud-init also.

From "/var/log/cloud-init.log".

2022-03-16 16:25:18,764 - util.py[WARNING]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>) failed
2022-03-16 16:25:18,764 - util.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/cloudinit/util.py", line 371, in multi_log
    wfh.flush()
OSError: [Errno 5] Input/output error

# cat /lib/systemd/system/cloud-final.service | grep StandardOutput
StandardOutput=journal+console

# echo > /dev/console
bash: echo: write error: Input/output error

During handling of the above exception, another exception occurred:

OSError: [Errno 5] Input/output error

[root@ubuntu-template log]# dmesg | grep console
[ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-358.el8.x86_64 root=UUID=395b9844-e404-4857-afbb-c6edccaf72f3 ro console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto
[ 0.000000] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-358.el8.x86_64 root=UUID=395b9844-e404-4857-afbb-c6edccaf72f3 ro console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto
[ 0.001000] printk: console [ttyS0] enabled
[ 4.191049] bochs-drm 0000:00:02.0: vgaarb: deactivate vga console

Revision history for this message
Andrew Lee (andrewlee-microsoft) wrote :

I've fixed this in https://github.com/canonical/cloud-init/pull/1340/files (sadly blocked on getting the CLA signed)

I'm interested that you've hit this too @tony-casanova-nc-33-2017.

FWIW, I'm using:
- cloud-init 22.1
- I'm on a Microsoft OS called 'Mariner'
- I'm on VMware vSphere V7 -> I've only ever seen this bug on VMware.

Are you using similar versions?

I'm fairly sure there's a corresponding VMware bug here (causing the kernel to incorrectly set up the /dev/ttyS0 serial device).

James Falcon (falcojr)
Changed in cloud-init (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
Brett Holman (holmanb) wrote :

Subject: Fixed in cloud-init version 22.2.
Comment: This bug is believed to be fixed in cloud-init in version 22.2. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.