xen_netfront devices unresponsive after hibernation/resume

Bug #1864041 reported by Francis Ginther
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ec2-hibinit-agent (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Invalid
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned
linux-aws (Ubuntu)
New
Undecided
Unassigned
Xenial
New
Undecided
Unassigned
Bionic
New
Undecided
Unassigned
Eoan
Won't Fix
Undecided
Unassigned
Focal
New
Undecided
Unassigned

Bug Description

[Impact]

The xen_netfront device is sometimes unresponsive after a hibernate and resume event. This is limited to the c4, c5, m4, m5, r4, r5 instance families, all of which are xen based, and support hibernation.

When the issue occurrs, the instance is inaccessible without a full restart. Debugging by running a process which outputs regularly to the serial console shows that the instance is still running.

[Test Case]

1) Launch an c4, c5, m4, m5, r4, r5 instance type with a 5.0 or 5.3 kernel with on-demand hibernation support enabled.
2) Start a long-running process which generates messages to the serial console
3) Begin observing these messages on the console (using the AWS UI or CLI to grab a screenshot).
4) Suspend and resume the instance, continuing to refresh the console screenshot.
5) The screenshot should continue to show updates even if ssh access is no longer working.

[Regression Potential]

The workaround in ec2-hibinit-agent is reloading the xen_netfront kernel module before restarting systemd-networkd. If the kernel module is removed (for example when hitting LP: #1615381) the module reloading fails and
the instance can not restore network connections. This is expected to a be very rare situation and the module reload is the best workaround the Kernel Team found to mitigate the original issue.

The workaround also adds a 2 second delay before reloading the modules to let things settle a bit after resuming. The 2 seconds is very short compared to the overall time needed resuming an instance.

[Original Bug Text]

The xen_netfront device is sometimes unresponsive after a hibernate and resume event. This is limited to the c4, c5, m4, m5, r4, r5 instance families, all of which are xen based, and support hibernation.

When the issue occurrs, the instance is inaccessible without a full restart. Debugging by running a process which outputs regularly to the serial console shows that the instance is still running.

A workaround is to build the xen_netfront module separately and restart the module and networking during the resume handler. For example:

modprobe -r xen_netfront
modprobe xen_netfront
systemctl restart systemd-networkd

With this workaround in place, the unresponsive issue is no longer observed.

To reproduce this problem:

1) Launch an c4, c5, m4, m5, r4, r5 instance type with a 5.0 or 5.3 kernel with on-demand hibernation support enabled.
2) Start a long-running process which generates messages to the serial console
3) Begin observing these messages on the console (using the AWS UI or CLI to grab a screenshot).
4) Suspend and resume the instance, continuing to refresh the console screenshot.
5) The screenshot should continue to show updates even if ssh access is no longer working.
---
ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.9
Architecture: amd64
DistroRelease: Ubuntu 18.04
Ec2AMI: ami-0edf3b95e26a682df
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-west-2a
Ec2InstanceType: m4.large
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
Package: linux-aws 4.15.0.1058.59
PackageArchitecture: amd64
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: User Name 5.0.0-1025.28-aws 5.0.21
Tags: bionic ec2-images
Uname: Linux 5.0.0-1025-aws x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy lxd netdev plugdev sudo video
_MarkForUpload: True

description: updated
Revision history for this message
Francis Ginther (fginther) wrote : Dependencies.txt

apport information

tags: added: apport-collected bionic ec2-images
description: updated
Revision history for this message
Francis Ginther (fginther) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Balint Reczey (rbalint) wrote :

I'm ready to put the workaround to ec2-hibinit-agent, but it would be much better if the kernel could be fixed.

Changed in ec2-hibinit-agent (Ubuntu Focal):
status: New → In Progress
Revision history for this message
Balint Reczey (rbalint) wrote :

In the proposed workaround there is a 15s sleep before the module reload. Is it needed and if so is this a lowest safe choice?

Revision history for this message
Francis Ginther (fginther) wrote :

@rbalint,

I'm going to revisit the need for this 15s timeout through additional testing. This was observed to avoid crashes in our user space memory consumption app, but the overcommit change should alleviate the need for this.

Revision history for this message
Francis Ginther (fginther) wrote :

@rbalint,

Testing has indicated that a sleep of at least 1 is needed, otherwise I see failure to restart the network on some instance types.

Revision history for this message
Balint Reczey (rbalint) wrote :

@fginther thanks, I keep a 2 second sleep to stay on the safe side.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ec2-hibinit-agent - 1.0.0-0ubuntu8

---------------
ec2-hibinit-agent (1.0.0-0ubuntu8) focal; urgency=medium

  * debian/hibinit-resume: Add extra steps around swapoff to avoid OOM errors.
    Also work around xen-netfront not resuming properly.
    Thanks to Francis Ginther for the initial patch (LP: #1863242, #1864041)

 -- Balint Reczey <email address hidden> Thu, 12 Mar 2020 14:05:06 +0100

Changed in ec2-hibinit-agent (Ubuntu Focal):
status: In Progress → Fix Released
Balint Reczey (rbalint)
description: updated
Balint Reczey (rbalint)
description: updated
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Francis, or anyone else affected,

Accepted ec2-hibinit-agent into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ec2-hibinit-agent/1.0.0-0ubuntu7.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in ec2-hibinit-agent (Ubuntu Eoan):
status: New → Fix Committed
tags: added: verification-needed verification-needed-eoan
Changed in ec2-hibinit-agent (Ubuntu Bionic):
status: New → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Francis, or anyone else affected,

Accepted ec2-hibinit-agent into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ec2-hibinit-agent/1.0.0-0ubuntu4~18.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Francis Ginther (fginther) wrote :

I've completed bionic testing with 500+ runs and no issues. Setting "verification-done-bionic".

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ec2-hibinit-agent - 1.0.0-0ubuntu4~18.04.4

---------------
ec2-hibinit-agent (1.0.0-0ubuntu4~18.04.4) bionic; urgency=medium

  * debian/hibinit-resume: Add extra steps around swapoff to avoid OOM errors.
    Also work around xen-netfront not resuming properly.
    Thanks to Francis Ginther for the initial patch (LP: #1863242, #1864041)

 -- Balint Reczey <email address hidden> Mon, 23 Mar 2020 13:03:38 +0100

Changed in ec2-hibinit-agent (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for ec2-hibinit-agent has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

tags: added: id-5e459f823f8a2435d44842eb
Revision history for this message
Francis Ginther (fginther) wrote :

I've done additional testing with eoan with this ec2-hibinit-agent with improved test results. Setting to `verification-done-eoan`.

tags: added: verification-done-eoan
removed: verification-needed-eoan
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ec2-hibinit-agent - 1.0.0-0ubuntu7.1

---------------
ec2-hibinit-agent (1.0.0-0ubuntu7.1) eoan; urgency=medium

  * debian/hibinit-resume: Add extra steps around swapoff to avoid OOM errors.
    Also work around xen-netfront not resuming properly.
    Thanks to Francis Ginther for the initial patch (LP: #1863242, #1864041)

 -- Balint Reczey <email address hidden> Mon, 23 Mar 2020 13:03:38 +0100

Changed in ec2-hibinit-agent (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote :

The Eoan Ermine has reached end of life, so this bug will not be fixed for that release

Changed in linux-aws (Ubuntu Eoan):
status: New → Won't Fix
Revision history for this message
Balint Reczey (rbalint) wrote :

@fginther Xenial does not need the backport, right?

Changed in ec2-hibinit-agent (Ubuntu Xenial):
status: New → Incomplete
Revision history for this message
Francis Ginther (fginther) wrote :

@rbalint, we have not seen the same issue with the 4.15 linux-aws-hwe kernels used in xenial. At this time, the backport to xenial is not necessary.

Balint Reczey (rbalint)
Changed in ec2-hibinit-agent (Ubuntu Xenial):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.