Ubuntu
linux package

bogus utime and stime in /proc/<PID/stat

Bug #1023214 reported by alecm3 on 2012-07-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Expired	High	Unassigned

Bug Description

I administer a large number of servers, and I have this problem only with Ubuntu 10.04 LTS: I run a server under normal load (say load average 3.0 on an 8-core server). The "top" command shows processes taking certain % of CPU that cause this load average: say

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11008 mysql 20 0 25.9g 22g 5496 S 67 76.0 643539:38 mysqld

ps -o pcpu,pid -p11008
%CPU PID
53.1 11008

everything is consistent.

The all of the sudden, the process causing the load average disappears from "top", but the process continues to run normally (albeit with a slight performance decrease), and the system load average becomes somewhat higher. The output of ps -o pcpu becomes bogus:

# ps -o pcpu,pid -p11008
%CPU PID
317910278 1587

This happened to at least 5 different severs (different brand new IBM System X hardware), each running different software: one httpd 2.2, one mysqld 5.1, and one Twisted Python TCP servers. Each time the kernel was between 2.6.32-32-server and 2.6.32-40-server. I updated some machines to 2.6.32-41-server, and it has not happened on those yet, but the bug is rare (once every 60 days or so).

This is from an affected machine:

Code:
top - 10:39:06 up 73 days, 17:57, 3 users, load average: 6.62, 5.60, 5.34
Tasks: 207 total, 2 running, 205 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.4%us, 18.0%sy, 0.0%ni, 66.3%id, 4.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 74341464k total, 71985004k used, 2356460k free, 236456k buffers
Swap: 3906552k total, 328k used, 3906224k free, 24838212k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
805 root 20 0 0 0 0 S 3 0.0 1493:09 fct0-worker
982 root 20 0 0 0 0 S 1 0.0 111:35.05 fioa-data-groom
914 root 20 0 0 0 0 S 0 0.0 884:42.71 fct1-worker
1068 root 20 0 19364 1496 1060 R 0 0.0 0:00.02 top

Nothing causing high load is showing on top, but I have two highly loaded mysqld instances on it, that suddenly show crazy %CPU:

#ps -o pcpu,pid,cmd -p1587
%CPU PID CMD
317713124 1587 /nail/encap/mysql-5.1.60/libexec/mysqld
and

#ps -o pcpu,pid,cmd -p1624
%CPU PID CMD
2802 1624 /nail/encap/mysql-5.1.60/libexec/mysqld

Here are the numbers from

# cat /proc/1587/stat
1587 (mysqld) S 1212 1088 1088 0 -1 4202752 14307313 0 162 0 85773299069 4611685932654088833 0 0 20 0 52 0 3549 27255418880 5483524 18446744073709551615 4194304 11111617 140733749236976 140733749235984 8858659 0 552967 4102 26345 18446744073709551615 0 0 17 5 0 0 0 0 0

The 14th and 15th numbers according to

man proc and fs/proc/array.c are supposed to be:

utime %lu Amount of time that this process has been scheduled in user mode, measured in clock ticks (divide by
sysconf(_SC_CLK_TCK). This includes guest time, guest_time (time spent running a virtual CPU, see
below), so that applications that are not aware of the guest time field do not lose that time from
their calculations.

stime %lu Amount of time that this process has been scheduled in kernel mode, measured in clock ticks (divide by
sysconf(_SC_CLK_TCK).

On a normal server, these numbers 13 orders of magnitude smaller and they are advancing by about 10 per sec. On a buggy server, these numbers are stuck at a ridiculously high value like 4611685932654088833, and they are not changing.

I noticed that this bug does not happen on 2.6.24-26-server , and there has been many changes in fs/proc/array.c between 2.6.24-26-server and 2.6.32-32-server when this bug starting to occur.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-server 2.6.32.40.47
Regression: Yes
Reproducible: No
ProcVersionSignature: Ubuntu 2.6.32-40.87-server 2.6.32.57+drm33.23
Uname: Linux 2.6.32-40-server x86_64
NonfreeKernelModules: iomemory_vsl
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
Date: Tue Jul 10 17:33:10 2012
Frequency: Once a month.
InstallationMedia: Ubuntu-Server 10.04.2 LTS "Lucid Lynx" - Release amd64 (20110211.1)
MachineType: IBM System x3550 M3 -[7944AC1]-
PciMultimedia:

ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-40-server root=UUID=ef643bc9-46ac-474d-bb00-4e0bfee77808 ro quiet
ProcEnviron:
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: linux
dmi.bios.date: 02/08/2011
dmi.bios.vendor: IBM Corp.
dmi.bios.version: -[D6E150CUS-1.11]-
dmi.board.asset.tag: (none)
dmi.board.name: 69Y5698
dmi.board.vendor: IBM
dmi.board.version: (none)
dmi.chassis.asset.tag: none
dmi.chassis.type: 23
dmi.chassis.vendor: IBM
dmi.chassis.version: none
dmi.modalias: dmi:bvnIBMCorp.:bvr-[D6E150CUS-1.11]-:bd02/08/2011:svnIBM:pnSystemx3550M3-[7944AC1]-:pvr00:rvnIBM:rn69Y5698:rvr(none):cvnIBM:ct23:cvrnone:
dmi.product.name: System x3550 M3 -[7944AC1]-
dmi.product.version: 00
dmi.sys.vendor: IBM

See original description

Tags:

Revision history for this message

alecm3 (alecm-chatango) wrote on 2012-07-11:

BootDmesg.txt Edit (79.3 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (1.4 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (1.4 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (62.1 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (484 bytes, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (6.6 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (6.0 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (2.3 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (94.3 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (219.8 KiB, text/plain; charset="utf-8")

alecm3 (alecm-chatango) on 2012-07-11

description:

updated

Brad Figg (brad-figg) on 2012-07-11

Changed in linux (Ubuntu):
status:	New → Confirmed

Joseph Salisbury (jsalisbury) on 2012-07-11

Changed in linux (Ubuntu):
importance:	Undecided → Medium

Revision history for this message

alecm3 (alecm-chatango) wrote on 2012-07-11:

I am not sure why it's marked "medium" importance: it keeps happening on all our production servers. Unfortunately, this bug will cause us to switch distributions.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-07-12:

@alecm3, do you have the option to test some kernels on a machine that exhibits this bug?

Changed in linux (Ubuntu):
importance:	Medium → High
tags:	added: kernel-da-key

Revision history for this message

Brad Figg (brad-figg) wrote on 2012-07-12:

@alecm3,

If you can:
1. The kernel you are running is not the most recent in -updates. Please update one of your systems and see if the problem is still happening.
2. If the problem still exists, please try one of our LTS backport packages. I'd suggest trying the Oneiric one and see if that makes a difference.

Revision history for this message

alecm3 (alecm-chatango) wrote on 2012-07-12:

I upgraded several machines to 2.6.32-41-server about a week ago, which seems to be the latest 10.04 LTS kernel. If the bug occurs on 2.6.32-41-server, I will add information immediately. I am not very optimistic about this upgrade, since the diff of fs/proc/array.c between 2.6.32-41-server and 2.6.32-40-server shows no changes, but there are substential diffs between 2.6.24-26-server and 2.6.32-32-server in that file, and that's the version when I first encountered the bug.

Revision history for this message

alecm3 (alecm-chatango) wrote on 2012-07-12:

@brad-figg
re: option 2
I could potentially try a backported kernel, do I need to use Oneiric or I can go with Pangolin, because Pangolin is LTS?
This upgrade would require some preparation and scheduling, I cannot do it immediately.

Revision history for this message

alecm3 (alecm-chatango) wrote on 2012-08-03:

I have a 2.6.32-40-server production machine that just entered this state after a month of normal operations. If anyone is interested, I can collect more data from it. So far,

# ps -o pid,pcpu,time -p3061
PID %CPU TIME
3061 14157 1184014533-20:50:00

# cat /proc/3061/stat
3061 (mysqld) S 2688 2682 1982 0 -1 4202752 6720120 0 9 0 72106386682 4611685975708793406 0 0 20 0 52 0 637627271 27166662656 5482170 18446744073709551615 4194304 11111617 140736788381792 140736788380800 8858659 0 552967 4102 26345 18446744073709551615 0 0 17 1 0 0 0 0 0

Revision history for this message

Stephen Croll (stephen-d-croll) wrote on 2013-02-07:

Download full text (5.3 KiB)

FYI: There are a pair of kernel commits that *might* solve this
issue, at least for newer kernels. The first can be found in
Linus's tree (as well as linux-next) and the second is currently
only in linux-next:

[scroll@scroll-kernel-vm linux-next]$ git remote -v
origin http://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (fetch)
origin http://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (push)

[scroll@scroll-kernel-vm linux-next]$ git log -n1 bea6832cc8c4a0a9a65dd17da6aaa657fe27bc3e
commit bea6832cc8c4a0a9a65dd17da6aaa657fe27bc3e
Author: Stanislaw Gruszka <email address hidden>
Date: Wed Aug 8 11:27:15 2012 +0200

sched: fix divide by zero at {thread_group,task}_times

    On architectures where cputime_t is 64 bit type, is possible to trigger
    divide by zero on do_div(temp, (__force u32) total) line, if total is a
    non zero number but has lower 32 bit's zeroed. Removing casting is not
    a good solution since some do_div() implementations do cast to u32
    internally.

This problem can be triggered in practice on very long lived processes:

      PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
       #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
       #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
       #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
       #3 [ffff880472a51cd0] die at ffffffff8100f26b
       #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
       #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
       #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
          [exception RIP: thread_group_times+0x56]
          RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
          RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
          RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
          RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
          R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
          R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
          ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
       #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
       #8 [ffff880472a51f40] sys_times at ffffffff81088524
       #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
          RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
          RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
          RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
          RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
          R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
          R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
          ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b

    Cc: <email address hidden>
    Signed-off-by: Stanislaw Gruszka <email address hidden>
    Signed-off-by: Peter Zijlstra <email address hidden>
    Link: http://<email address hidden>
    Signed-off-by: Thomas Gleixner <email address hidden>

[scroll@scroll-kernel-v...

FYI: There are a pair of kernel commits that *might* solve this
issue, at least for newer kernels.  The first can be found in
Linus's tree (as well as linux-next) and the second is currently
only in linux-next:

[scroll@scroll-kernel-vm linux-next]$ git remote -v
origin  http://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (fetch)
origin  http://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (push)

[scroll@scroll-kernel-vm linux-next]$ git log -n1 bea6832cc8c4a0a9a65dd17da6aaa657fe27bc3e
commit bea6832cc8c4a0a9a65dd17da6aaa657fe27bc3e
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Wed Aug 8 11:27:15 2012 +0200

sched: fix divide by zero at {thread_group,task}_times
    
    On architectures where cputime_t is 64 bit type, is possible to trigger
    divide by zero on do_div(temp, (__force u32) total) line, if total is a
    non zero number but has lower 32 bit's zeroed. Removing casting is not
    a good solution since some do_div() implementations do cast to u32
    internally.
    
    This problem can be triggered in practice on very long lived processes:
    
      PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
       #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
       #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
       #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
       #3 [ffff880472a51cd0] die at ffffffff8100f26b
       #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
       #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
       #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
          [exception RIP: thread_group_times+0x56]
          RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
          RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
          RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
          RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
          R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
          R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
       #8 [ffff880472a51f40] sys_times at ffffffff81088524
       #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
          RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
          RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
          RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
          RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
          R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
          R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
          ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

[scroll@scroll-kernel-vm linux-next]$ git log -n1 62188451f0d63add7ad0cd2a1ae269d600c1663d 
commit 62188451f0d63add7ad0cd2a1ae269d600c1663d
Author: Frederic Weisbecker <fweisbec@gmail.com>
Date:   Sat Jan 26 17:19:42 2013 +0100

cputime: Avoid multiplication overflow on utime scaling
    
    We scale stime, utime values based on rtime (sum_exec_runtime
    converted to jiffies). During scaling we multiple rtime * utime,
    which seems to be fine, since both values are converted to u64,
    but it's not.
    
    Let assume HZ is 1000 - 1ms tick. Process consist of 64 threads,
    run for 1 day, threads utilize 100% cpu on user space. Machine
    has 64 cpus.
    
    Process rtime = utime will be 64 * 24 * 60 * 60 * 1000 jiffies,
    which is 0x149970000. Multiplication rtime * utime result is
    0x1a855771100000000, which can not be covered in 64 bits.
    
    Result of overflow is stall of utime values visible in user
    space (prev_utime in kernel), even if application still consume
    lot of CPU time.
    
    A solution to solve this is to perform the multiplication on
    stime instead of utime. It's easy to grow the utime value fast
    with a CPU bound thread in userspace for example. Now we assume
    that doing so with stime is much harder. In most cases a task
    shouldn't ever spend much time in kernel space as it tends to
    sleep waiting for jobs completion when they take long to
    achieve. IO is the typical example of that.
    
    Hence scaling the cputime by performing the multiplication on
    stime instead of utime should considerably reduce the chances of
    an overflow on most workloads.
    
    This is largely inspired by a patch from Stanislaw Gruszka:
    http://lkml.kernel.org/r/20130107113144.GA7544@redhat.com
    
    Inspired-by: Stanislaw Gruszka <sgruszka@redhat.com>
    Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
    Acked-by: Stanislaw Gruszka <sgruszka@redhat.com>
    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Link: http://lkml.kernel.org/r/1359217182-25184-1-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

Revision history for this message

alecm3 (alecm-chatango) wrote on 2013-08-18:

@Stephen Croll :Do you know if this patch has been incorporated into 12.04LTS? What about 10.04LTS?

Revision history for this message

penalvch (penalvch) wrote on 2013-12-30:

#10

alecm3, the patch mentioned by Stephen Croll is available in Precise+ as per http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-precise.git;a=commit;h=a041ea7c6c7b5ab0b4cfc2d0d50fbab9a9f9602c . Would it be possible to test this out in Precise server via http://releases.ubuntu.com/precise/ and advise to the results?

Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-03-01:

#11

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

bogus utime and stime in /proc/<PID/stat

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package