System shuts down due to CPU temp exceeding critical thresh-hold (100C)

Bug #953205 reported by Tony Espy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
Colin Ian King

Bug Description

My 12.04 system has shutdown automatically twice over the last week. All of a sudden I'm logged out, and I'm presented with the shutdown splash screen, and then it powers off.

My system is a Lenovo Thinkpad 410s with Intel graphics, it has an SSD for it's main drive. It has the following CPU:

Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz

I poked into my kern.log and a bunch of temperature related messages over an hour or, the last two are here:

Mar 7 17:59:23 alien kernel: [124688.525648] intel ips 0000:00:1f.6: MCP limit exceeded: Avg temp 9370, limit 9000
Mar 7 17:59:28 alien kernel: [124693.516730] intel ips 0000:00:1f.6: MCP limit exceeded: Avg temp 9165, limit 9000

I then see a jump in the log time-stamp of ~15min.

With the second occurrence in the log, I actually see a shutdown message logged:

Mar 9 14:02:02 alien kernel: [24690.341160] Critical temperature reached (100 C), shutting down.
Mar 9 14:02:02 alien kernel: [24690.347677] Critical temperature reached (100 C), shutting down.

I'm running 12.04, which I upgraded to after Beta1 was released. The system is as up-to-date as possible.

Kernel:

Linux version 3.2.0-18-generic-pae (buildd@rothera) (gcc version 4.6.3 (Ubuntu\
/Linaro 4.6.3-1ubuntu2) ) #28-Ubuntu SMP Fri Mar 2 22:11:12 UTC 2012 (Ubuntu 3.2.0-18.28-generic-pae 3.2.9)
---
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 1.94.1-0ubuntu2
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: CONEXANT Analog [CONEXANT Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: espy 2014 F.... pulseaudio
 /dev/snd/pcmC0D0p: espy 2014 F...m pulseaudio
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xf2520000 irq 45'
   Mixer name : 'Intel IbexPeak HDMI'
   Components : 'HDA:14f15069,17aa21a4,00100302 HDA:80862804,17aa21b5,00100000'
   Controls : 26
   Simple ctrls : 8
Card29.Amixer.info:
 Card hw:29 'ThinkPadEC'/'ThinkPad Console Audio Control at EC reg 0x30, fw 6UHT29WW-1.10'
   Mixer name : 'ThinkPad EC 6UHT29WW-1.10'
   Components : ''
   Controls : 1
   Simple ctrls : 1
Card29.Amixer.values:
 Simple mixer control 'Console',0
   Capabilities: pswitch pswitch-joined penum
   Playback channels: Mono
   Mono: Playback [on]
DistroRelease: Ubuntu 12.04
EcryptfsInUse: Yes
InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Release i386 (20110427.1)
MachineType: LENOVO 2901CTO
Package: linux (not installed)
ProcEnviron:
 TERM=xterm
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-18-generic-pae root=UUID=b154a7ff-6a2a-4dae-804d-0e17319a3cec ro quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 3.2.0-18.29-generic-pae 3.2.9
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-18-generic-pae N/A
 linux-backports-modules-3.2.0-18-generic-pae N/A
 linux-firmware 1.71
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
StagingDrivers: mei
Tags: precise staging
Uname: Linux 3.2.0-18-generic-pae i686
UpgradeStatus: Upgraded to precise on 2012-03-03 (8 days ago)
UserGroups: adm admin cdrom dialout lpadmin plugdev sambashare
dmi.bios.date: 06/07/2010
dmi.bios.vendor: LENOVO
dmi.bios.version: 6UET38WW (1.16 )
dmi.board.name: 2901CTO
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr6UET38WW(1.16):bd06/07/2010:svnLENOVO:pn2901CTO:pvrThinkPadT410s:rvnLENOVO:rn2901CTO:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 2901CTO
dmi.product.version: ThinkPad T410s
dmi.sys.vendor: LENOVO

Revision history for this message
Tony Espy (awe) wrote : AcpiTables.txt

apport information

tags: added: apport-collected precise staging
description: updated
Revision history for this message
Tony Espy (awe) wrote : AlsaDevices.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : AplayDevices.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : BootDmesg.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : CRDA.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : Card0.Amixer.values.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : Card0.Codecs.codec.0.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : Card0.Codecs.codec.3.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : IwConfig.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : Lspci.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : Lsusb.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : PciMultimedia.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : ProcModules.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : PulseList.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : UdevDb.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : UdevLog.txt

apport information

Revision history for this message
Tony Espy (awe) wrote : WifiSyslog.txt

apport information

Revision history for this message
Tony Espy (awe) wrote :

Added kern.1.log, as it's the file with the relevant log messages.

Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Tony Espy (awe) wrote :

There's a good chance this is a duplicate of bug #751689.

description: updated
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Tony,

This could in fact be a duplicate of bug 751689. Do you know if this just started happening to your system after upgrading to Precise? Did you have over-heating issues in prior releases?

Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Colin Ian King (colin-king) wrote :

Tony, so a couple of things:

Can you boot with a Oneiric kernel and see if these issues re-occur and let me know, I just want to first factor out the kernel from the changes.

Also can you also do:

sudo add-apt-repository ppa:colin-king/powermanagement
sudo apt-get update && sudo apt-get install tp-thermstat

and run it in the background so we get a snapshot of what's going on before the machine hangs. Run it as follows:

sudo th-thermstat 1 > thermstat.log

..it will produce a lot of data over a day, so beware. Once you get an over-heating situation, reboot and compress and attach the thermstat.log to this bug, then I can see what's going on with CPU, fans and thermal settings.

Thanks.

Revision history for this message
Colin Ian King (colin-king) wrote :

Oops, typo, should be:

sudo tp-thermstat 1 > thermstat.log &

Changed in linux (Ubuntu):
assignee: nobody → Colin King (colin-king)
status: Confirmed → Incomplete
Revision history for this message
Colin Ian King (colin-king) wrote :

@Tony, since the tp-thermstat gathers data on running processes, you may want to email the data to me rather than put it in a public bug.

Revision history for this message
Tony Espy (awe) wrote :

@Joe, guess we think alike ( see comment #22 ), and to answer your question, yes it started happening after I upgraded a few days after Beta1.

@Colin, re: comment #24:

 * the problem has only happened twice, and I cannot reproduce on demand, so not sure how I'll be able to do the oneiric kernel experiment easily

 * also to be clear, in both cases the system cleanly shut down ( including the Splash screen ), it didn't hang.

 * re: the other instructions -> ACK

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Colin Ian King (colin-king) wrote :

Hi Tony,

I've applied a patch that may help make passive cooling (CPU frequency scaling) work better. It seems that CPU frequencies may not all be scaled down together which could account for some overheating.

Can you download and install the kernel .debs found in http://zinc.canonical.com/~cking/hot-tp/

If you don't mind trying these out and seeing if you can overheat your machine with this possible fix and report back. Thanks!

Revision history for this message
GreyGeek (greygeek77) wrote :

This problem locked up my screen and keyboard on my Acer 7739-6830 while running Minecraft 1.2.4 this morning. Other apps have been running when this lockup occurred: Firefox 11.0, Chromium-Browser, and a couple other apps, so the problem seems to be app independent.

There is a similar bug report here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/636045
which is related to this problem and has been around since August of 2010.
dmesg | grep 'intel ips'
[ 22.342927] intel ips 0000:00:1f.6: CPU TDP doesn't match expected value (found 25, expected 29)
[ 22.343191] intel ips 0000:00:1f.6: PCI INT A -> GSI 21 (level, low) -> IRQ 21
[ 22.347192] intel ips 0000:00:1f.6: IPS driver initialized, MCP temp limit 90

Revision history for this message
Tony Espy (awe) wrote :

@Colin

I'm running 32-bit with the PAE kernel, so I can't install the amd64 debs. If you could spin me a PAE kernel, I can give it a try tomorrow.

Revision history for this message
Colin Ian King (colin-king) wrote : Re: [Bug 953205] Re: System shuts down due to CPU temp exceeding critical thresh-hold (100C)

On 26/03/12 21:30, Tony Espy wrote:
> @Colin
>
> I'm running 32-bit with the PAE kernel, so I can't install the amd64
> debs. If you could spin me a PAE kernel, I can give it a try tomorrow.
>
@Tony, 32 bit-pae kernel now ready for your testing

Revision history for this message
Tony Espy (awe) wrote :

@Colin

OK, everything installed, and tp-thermstat running...

I had a bit of a scare, as unity failed to start on my first reboot after installing your kernel, but a subsequent reboot worked OK.

I'll try and run some heavy load tests later today and let you know what happens.

Revision history for this message
Tony Espy (awe) wrote :

@Colin

I was able to trigger the shutdown again using your kernel. I had two monitors running, with a bazillion chromium tabs, mumble, ... plus a kernel build ( using debuild ).

What's curious is that the last entry in the thermstat.log was only 92.0C.

The mod time of the log was 14:02. The kernel shutdown messages are below:

Mar 28 14:02:35 alien kernel: [24578.255932] Critical temperature reached (100 C), shutting down.
Mar 28 14:02:36 alien kernel: [24578.265330] Critical temperature reached (100 C), shutting down.

If you want, I can attach the whole kernel log, but I figure the thermstat.log and the messages are enough for now.

Revision history for this message
Tony Espy (awe) wrote :

The current theory according Colin is that the CPU scaling is working properly, and that the fundamental problem lies with the Embedded Controller not being able to respond quick enough with fan adjustments when the CPU temperature spikes.

So per Colin's instructions, I set manually set the fan ( see [1] for instructions ) to "disengaged" mode, which sets the fan to a rate higher than the automatic "maximum" setting ( level == 7 ). The maximum auto setting corresponds to ~4.5k rpm, whereas "disengaged" allows the fan to spin to ~6.5k rpm.

That said, with my fan set to "disengaged" I tried running a full kernel build ( 12.04 kernel, pulled using apt-get source, and built using vanilla 'debuild' ) and after an hour or so, my machine again hit the critical temperature and shut down again. Note, I was purposely trying to max out the machine, so I had my usual set of programs running ( 15-20 chromium-browser tabs, thunderbird, eclipse, and xchat-gnome ).

In short, this test has shown that the machine can't cope with an extreme load even when the fan is cranked up all the way.

I may try contacting Lenovo support next...

Also per agreement with Colin, I'm changing the Status of this bug to "Confirmed".

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Tony Espy (awe) wrote :

Sigh, my extended warranty ran out on August 8. ;(-

That said, there are *many* reports of heat-related shutdowns for the 410s running Windows as well. Sigh... shouldn't have listened to Jerone when I asked for a machine recommendation.

Revision history for this message
Colin Ian King (colin-king) wrote :

On 23/08/12 19:32, Tony Espy wrote:
> Sigh, my extended warranty ran out on August 8. ;(-
>
> That said, there are *many* reports of heat-related shutdowns for the
> 410s running Windows as well. Sigh... shouldn't have listened to
> Jerone when I asked for a machine recommendation.
>

Sorry to hear that :-(

I wonder if there any BIOS fixes to address this..

Colin

Revision history for this message
Tony Espy (awe) wrote :

Turns out I was able to get an exception and a tech is on the way to my house within the hour.

Revision history for this message
Tony Espy (awe) wrote :

A new motherboard complete with new fan seems to have fixed the problem as I'm now able to run a full kernel build without my machine shutting down.

I'm changing the status to Invalid.

Changed in linux (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.