CPU1 on Dell PowerEdge M610, R715 and IBM X3500 M3 goes offline after exercising frequency governors

Bug #926136 reported by Brendan Donegan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Checkbox
Fix Released
High
Brendan Donegan
linux (Ubuntu)
Invalid
Medium
Unassigned
Precise
Invalid
Medium
Unassigned

Bug Description

Running Precise Alpha2 on a Dell PowerEdge M610 I am finding that:

echo 0 > /sys/devices/system/cpu/cpu1/online

fails because the content of the 'online' file is already 0. This means 1 of the systems 16 cores is offline by default. I have not seen this in Oneiric so I assume it's not intended.

Related branches

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 926136

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Attached is the cpuinfo file from the system. While getting this I made the discovery that the /sys/devices/system/cpu/online file on this system contains:

0,2-15

This would explain why one CPU is offline.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

I can confirm that with an Oneiric install all 16 cores are online and the contents of /sys/devices/system/cpu/online is:

0-15

Jeff Lane  (bladernr)
summary: - CPU1 on Dell PowerEdge M610 is offline by default
+ CPU1 on Dell PowerEdge M610 and R715 is offline by default
Revision history for this message
Jeff Lane  (bladernr) wrote : Re: CPU1 on Dell PowerEdge M610 and R715 is offline by default

Just confirming brendan's observations. This behaviour has been seen with Precise Alpha 2 on both the M610 and R715 systems. I have attached an apport report dump from one of hte affected systems, the R715.

Unfortunately, Brendan tried apport-collect on the M610 and I tried on the R715 and apport-collect just hangs and doesn't seem to do anything. So I'm attaching the output of the following:

apport-cli --save 926136.report linux

Sorry for having to do it this way, but its the only way I could see to get you the logs necessary.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.3 kernel[1]. This test will tell us if the bug is already fixed upstream.

[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc2-precise/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key precise regression-release
Revision history for this message
Brad Figg (brad-figg) wrote : Test with newer development kernel (3.2.0-13.22)

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get upgrade

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.2.0-13.22
tags: added: blocks-hwcert
Revision history for this message
Jeff Lane  (bladernr) wrote : Re: CPU1 on Dell PowerEdge M610 and R715 is offline by default

Brad:

ubuntu@ubuntu:~$ uname -a
Linux ubuntu 3.2.0-12-generic #21-Ubuntu SMP Tue Jan 31 18:48:57 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

there is no new kernel listed when I try to do an upgrade

The following NEW packages will be installed:
  libllvm3.0 libxcb-glx0
The following packages have been kept back:
  checkbox-cli
The following packages will be upgraded:
  accountsservice checkbox checkbox-certification checkbox-certification-server
  command-not-found debianutils ghostscript ghostscript-cups icedtea-6-jre-cacao
  icedtea-6-jre-jamvm initscripts iproute language-selector-common libaccountsservice0
  libgl1-mesa-dri libgl1-mesa-glx libglapi-mesa libgs9 libgs9-common makedev mountall
  openjdk-6-jre-headless resolvconf sysv-rc sysvinit-utils update-manager-core
26 upgraded, 2 newly installed, 0 to remove and 1 not upgraded.
Need to get 47.6 MB/48.5 MB of archives.
After this operation, 25.4 MB of additional disk space will be used.

How do I get access to the new dev kernel?

Revision history for this message
Jeff Lane  (bladernr) wrote :

Joseph:

ubuntu@ubuntu:/sys/devices/system/cpu$ uname -a; cat online
Linux ubuntu 3.3.0-030300rc2-generic #201201311735 SMP Tue Jan 31 22:36:50 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
0-15

I am running the current mainline kernel and all CPUs are online. So this, so far, has been seen in our 3.2.0-12-21 kernel (I don't know for sure how to get -22 installed, is there a separate dev ppa for that?)

Changed in linux (Ubuntu):
status: Incomplete → Opinion
status: Opinion → Confirmed
Revision history for this message
Brad Figg (brad-figg) wrote : Test with newer development kernel (3.2.0-14.23)

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get upgrade

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.2.0-14.23
Jeff Lane  (bladernr)
tags: added: bot-stop-nagging
Revision history for this message
Dave Gilbert (ubuntu-treblig) wrote : Re: CPU1 on Dell PowerEdge M610 and R715 is offline by default

From the log messages:

[ 1537.714420] ADDRCONF(NETDEV_UP): eth3: link is not ready
 [ 1905.620215] Broke affinity for irq 119
 [ 1905.621488] CPU 1 is now offline
 [ 1909.120015] mpt2sas 0000:05:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

It looks to me like the events at 1905 are associated. Not sure what is the relevance of IRQ 119.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Another test run done today seems to indicate that the CPU is not offline initially. Perhaps one of our tests is making it go offline (this wouldn't be intentional and might still be a bug)?

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

We also have an IBM X3500 M3 exhibiting the same. Common thread is still the 16 cores. CPU model doesn't seem to be the factor.

summary: - CPU1 on Dell PowerEdge M610 and R715 is offline by default
+ CPU1 on Dell PowerEdge M610, R715 and IBM X3500 M3 goes offline
+ 'randomly'
Revision history for this message
Brendan Donegan (brendan-donegan) wrote : Re: CPU1 on Dell PowerEdge M610, R715 and IBM X3500 M3 goes offline 'randomly'

I changed the title to further reflect the circumstances this happens in. I can't think of a better word than 'randomly' right now as the CPU is not being specifically requested to go offline and we haven't determined what the cause is yet.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Note that this means that the problem may not actually be gone in the upstream kernel and may be present in Oneiric. We haven't specifically tested these releases in the same way yet.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

This seems to be caused by running our cpu_frequency_governers test. This script is playing around with the frequency governors supported by different CPUs. I'm going to install Oneiric now and see if the same thing happens.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Confirming gain that this is *not* the case with Oneiric, even if the frequency_governors test is run

summary: - CPU1 on Dell PowerEdge M610, R715 and IBM X3500 M3 goes offline
- 'randomly'
+ CPU1 on Dell PowerEdge M610, R715 and IBM X3500 M3 goes offline after
+ excercising frequency governors
summary: CPU1 on Dell PowerEdge M610, R715 and IBM X3500 M3 goes offline after
- excercising frequency governors
+ exercising frequency governors
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Also not the case in the mainline kernel it seems.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

So, if I reboot the system then all the CPUs are online again. Furthermore if I rerun cpu_frequency_governors then CPU1 has not gone offline (all of this with the 3.2.0-12.21 kernel) after it finishes. This makes it difficult to determine if the problem is really gone in the mainline kernel. I would potentially need an ISO spun with the mainline kernel in it.

Ara Pulido (ara)
tags: removed: bot-stop-nagging
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Brendan,

So just to confirm, this issue does not happen in Oneiric. Were you able to test with the latest 3.3 kernel[0], or are you unable to run your cpu_frequency_governers test against the mainline kernel?

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc2-precise/

Changed in linux (Ubuntu Precise):
status: Incomplete → Confirmed
Revision history for this message
Brad Figg (brad-figg) wrote : Test with newer development kernel (3.2.0-15.24)

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get upgrade

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.2.0-15.24
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Hi Joseph,

The problem I have is that if I reboot the system then the problem goes away and even doesn't reoccur when I run the script which triggers it. If I could get an ISO that uses the mainline kernel or whichever kernel you want me to test with then I could be sure it does or doesn't fix the issue.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Brendan,

There is no iso for the mainline kernel that I know of. Only the .deb files, which are available at:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc2-precise/

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

I can't reproduce this problem except on first boot, so I'm unable to install new kernels and test them. It's still present in Precise Beta 1 though.

Changed in linux (Ubuntu Precise):
status: Incomplete → Triaged
tags: added: kernel-key
Revision history for this message
Andy Whitcroft (apw) wrote :

Could we confirm if this is still present with Beta-2 images. If so could you get dmesg and cat /sys/devices/system/cpu/online for the failed boot. The dmesg should contain information on whether we brought the cpu up ever.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

You must mean Beta-1. Yeah, it still happens. Dmesg attached and the output of the cat command is:

ubuntu@ubuntu:~$ cat /sys/devices/system/cpu/online
0,2-15

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Please remember that as mentioned in comment #17 this happens after the cpu_frequency_governors test is run. This plays around with the governors used by the CPUs so this might not be getting handled properly.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

I can get a copy of dmesg after frequency governors is run as well, but it should be similar to the one attached to the report.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Could you also point us to a copy of the cpu_frequency_governors test? Thanks.

Revision history for this message
Daniel Manrique (roadmr) wrote :

@Leann:

The test in question is checkbox's cpu_scaling_test, the job definition shows it runs as:

sudo nice -n -20 cpu_scaling_test

this would be located in /usr/share/checkbox/scripts in an installed system, or you can look at the latest version from trunk here:

http://bazaar.launchpad.net/~checkbox-dev/checkbox/trunk/view/head:/scripts/cpu_scaling_test

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Is any more information needed here?

Revision history for this message
Colin Ian King (colin-king) wrote :

@Brendan, can I get access to this machine so I can debug it further? Also, we haven't got anything like powernap running during theses tests have we?

Changed in linux (Ubuntu Precise):
assignee: nobody → Colin King (colin-king)
Revision history for this message
Jeff Lane  (bladernr) wrote :

Colin: Replied directly via email with connection info.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

@Colin, I will get this setup. Ping me in the morning to make sure I've done it.

Revision history for this message
Colin Ian King (colin-king) wrote :

So this bug isn't a kernel issue at all. the culprit is the following script:

http://bazaar.launchpad.net/~checkbox-dev/checkbox/trunk/view/head:/scripts/cpu_offlining

This offlines CPU1, and then tries to grep for cpu1 in /proc/interrupts for some reason that completely escapes me at the moment, and this grep fails and the test exits.

I've manually worked through all the CPUs and I can offline them and on-line them with no problem at all.

I hacked out the dodgy grep and I can run the script and I've observed the kernel messages stating each CPU is offline'd and on-line'd w/o any problems:

#!/bin/bash

echo "Beginning CPU Offlining Test"
# Turn CPU cores off
for cpu_num in `ls /sys/devices/system/cpu | grep -o cpu[0-9]*`;
do
        if [ -f /sys/devices/system/cpu/$cpu_num/online ]; then
        echo "Offlining $cpu_num"
                echo 0 > /sys/devices/system/cpu/$cpu_num/online
                #grep -i -q $cpu_num /proc/interrupts

                #if [ $? == 0 ]; then
                        #exit 1
                #fi
        fi
done

# Back on again
for cpu_num in `ls /sys/devices/system/cpu | grep -o cpu[0-9]*`;
do
    if [ -f /sys/devices/system/cpu/$cpu_num/online ]; then
        echo "Onlining $cpu_num"
        echo 1 > /sys/devices/system/cpu/$cpu_num/online
        #grep -i -q $cpu_num /proc/interrupts

        #if [ $? == 1 ]; then
            #exit 1
        #fi
    fi
done

/proc/interrupts displays the CPU number in upper case, not lower case. Just checked and it seems to have been doing this since Lucid or even way before that, so goodness knows how this script is meant to work correctly.

Revision history for this message
Colin Ian King (colin-king) wrote :

Just to say, this is one of those issues where bailing out of a situation early without restoring the state gets us in one big hole. The script offlines a CPU, then bails out. When detecting any error condition, code should try to restore back to the proper pre-test state where possible, e.g. on-lining the CPU. So, this needs fixing IMHO

Revision history for this message
Colin Ian King (colin-king) wrote :

Back to comment #1, offlining and already offlined CPU is giving you an invalid argument error. Likewise, onlining and already onlined CPU will do the same.

The semantics of the interface may have changed between releases, but I don't think we should class this as a bug since these interfaces do change over time. I will check to see if it has changed and the reasoning behind it.

Revision history for this message
Colin Ian King (colin-king) wrote :

My intuition is telling me once we have the script fixed it won't be offlining CPU1 and bailing out and hence we won't see this problem. Lets get the script sorted out and then re-test.

Incidentally, the semantics of the /sys/devices/system/cpu/cpu*/online are such that offlining an already offlined CPU or onlining and already onlined CPU will result in an EINVAL which makes sense.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

I'll fix up the script so that it:

a.) Avoids offlining CPUs which already are (and vice versa)
b.) Doesn't bail out and leave things in an indeterminate state

I *still* don't understand how this wouldn't impact any system though, and this bug is only being seen on some.

Changed in linux (Ubuntu Precise):
assignee: Colin King (colin-king) → Brendan Donegan (brendan-donegan)
assignee: Brendan Donegan (brendan-donegan) → nobody
Changed in checkbox:
assignee: nobody → Brendan Donegan (brendan-donegan)
importance: Undecided → High
Revision history for this message
Colin Ian King (colin-king) wrote :

@Brendan, ping me when you have some results, I will dig into this deeper if we trip the problem.

Revision history for this message
Colin Ian King (colin-king) wrote :

Just for the record, I think the test script needs some more bullet-proofing.

1. The echos to /sys/devices/system/cpu/$cpu_num/online should be followed by checks afterwards to make sure EINVAL errors aren't returned by the kernel. Just a simple bit of sanity checking can't harm.

2. It is worth reading out the state from /sys/devices/system/cpu/$cpu_num/online to ensure it contains what was written for a sanity check.

Revision history for this message
Colin Ian King (colin-king) wrote :

@Brendan the reason why it fails on machines with 10 or more CPUs is because one is grep'ing for cpu1 which matches on CPU10, CPU11, CPU12, CPU13, CPU14, CPU15 on the 16 CPU machine. So the script is broken.

It looks like /proc/interrupts CPU* headed columns end with trailing spaces, so we could bodge around this using:

grep -i -q "$cpu_num " /proc/interrupts

note the trailing space after $cpu_num.

But the /proc/interrupts format may change and remove those trailing spaces sometime in the future. So perhaps the line should be something like:

cat /proc/interrupts | grep CPU | tr '\n' ' ' | grep -i -q $cpu_num

..just in case one day they remove the trailing spaces after the last CPU number.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

@Colin,

Yes, I had this same thought yesterday. To make the grep do what it's supposed to do I can just use the '-w' option, meaning it will only match on word boundaries, so CPU1 won't match CPU11. Along with the robustness enhancements suggested, I'll be making these changes and resubmitting.

The whole 'working on Oneiric but not Precise' thing really threw me (the grep should have been failing in both), but it seems like fixing the script is going to get rid of the problem.

tags: removed: kernel-key
tags: removed: blocks-hwcert
Revision history for this message
Colin Ian King (colin-king) wrote :

@Brendan, do the fixes to the script resolve this problem?

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

@Colin,

Yes this is fixed now

Changed in checkbox:
status: New → Fix Released
Changed in linux (Ubuntu):
status: Triaged → Invalid
Changed in linux (Ubuntu Precise):
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.