High load average, disk read, no apparent reason - 2.6.28-11

Bug #367377 reported by Bobbi Manners
40
This bug affects 5 people
Affects Status Importance Assigned to Milestone
xserver-xorg-video-intel (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Binary package hint: linux-generic

I am not at all sure this is a kernel bug, but I don't know what else to file it against. Perhaps someone who understands the VM subsystem could look at my logs and provide some indication of what may be the culprit here.

Symptoms:

At apparently random times when I am working, the system will be come extremely slow. This frequently occurs on starting a new application, which may be Firefox, Open Office or Okular for example. The load average spikes to a large value (greater than 10) and the hard drive light lights up, showing continuous disk I/O. The user interface becomes unusable and the mouse pointer update is very slow (several seconds lag). More often than not the system becomes so unresponsive that I cannot change to a text VT and kill X, or even use ctrl-alt-bkspace. I have to resort to Magic-SysRq-RSEIUB and reboot.

This is happening several times a day on a system with 2GB of RAM under a light workload of KDE4.2, Thunderbird, Firefox and sometimes OpenOffice or Netbeans. At the time of onset the memory usage as shown by 'htop' is around 400MB used of 2GB physical (this excludes 'cached'). It is impossible to predict just what action may start a frenzy of disk activity and a loss of responsiveness. Sometimes even opening a page in an existing Firefox instance is sufficient to effectively DOS the system. **IMPACT IS SEVERE**

I am currently running up-to-date Jaunty with kernel 2.6.28-11, but I have had similar behaviour with Intrepid in the past. I believe that upgrading this machine from 1GB to 2GB has made matters worse. I have no swap device configured, but I have tried configuring swap in the past using a swapfile, and this has not prevented the problem from occurring.

My confusion is what could be causing the disk I/O that is bogging down the system, and driving the load average up as processes wait on I/O? It can not be swap in the sense of using a swap partition or file, since I have none configured. I don't have any reason to believe it is application I/O either - the problem occurs on starting many different apps, and appears to be triggered by requesting more memory. I can only think it is paging executables / shared libs in and out, trying to make room. I can't understand why this is happening, with so my physical memory free.

Logs:

This problem has proven difficult to diagnose because usually when it occurs I lose control of the machine and existing programs like top and so on stop updating. I grabbed a little script from another thread on Launchpad which runs 'top' and 'vmstat' and dumps the output to a file. Today I had this script running while I had a 'high load average' event. In this case, I did not lose control of the machine as sometimes happens, but the load average spiked to 6 or so for no apparent reason.

Running KDE 4.2, with Thunderbird open and maybe two Dolphin windows. Started OpenOffice Word Processor - it took maybe 5 mins to start it and shut it down. Meanwhile load average is 6 or so, and mouse pointer unresponsive, disk is thrashing. Once OO had shut down, the thrashing stopped. I was able to repeat this behaviour by starting OO Calc, which also took minutes to start up and shut down. During this time, I was logging 'top' and 'vmstat' output, which I will attach to this bug. The script logs 'vmstat 1' and 'top -b -d 1' to 'vmstat.log' and 'top.log' respectively.

I am not sure what other evidence I can capture. I noticed that 'iotop' showed OO was reading from disk at around 5MB/s when the disk was thrashing. What it could be reading @ 5MB/s for several minutes I cannot imagine, unless pages are being thrashed in and out of memory for some reason.

Can someone understand my log files and point to the offending process? If you can suggest other information to capture, please let me know. I am an experienced Linux user and I am happy to spend some time on this as it is rendering my main development machine unusable!

I don't think this problem is tied to a memory leak in any particular user-space end-user application (although it could be a leak of some sort in the X-Server or KDE4 components).

Hardware:
HP Pavilion dv1680ea
Core Duo 1.87GHz
2GB RAM
Intel integrated graphics

Software:
Kubuntu 09.04 2.6.28-11 kernel
KDE 4.2
Firefox 3.0.9, OOo 3.0 and other apps
NO SWAP CONFIGURED

[lspci]
00:00.0 Host bridge: Intel Corporation Mobile 945GM/PM/GMS, 943/940GML and 945GT Express Memory Controller Hub (rev 03)
     Subsystem: Hewlett-Packard Company Device 30a0
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)
     Subsystem: Hewlett-Packard Company Device 30a0

Tags: high-cpu
Revision history for this message
Bobbi Manners (bobbi.manners) wrote :
Revision history for this message
Bobbi Manners (bobbi.manners) wrote :
Revision history for this message
Bobbi Manners (bobbi.manners) wrote :
Revision history for this message
Bobbi Manners (bobbi.manners) wrote :
Revision history for this message
Andy Whitcroft (apw) wrote :

This sounds like a kernel issue and therefore should be against linux rather than -meta. Moving to the appropriate package.

affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Bobbi Manners (bobbi.manners) wrote :

Further Followup - Behaviour with Swap Enabled
-------------------------------------------------------------

After posting this bug, I created swap partition on a spare ~1GB disk partition. (Who needs HP "Quickplay" anyway?)

The results were interesting ...

I noticed over the course of 48 hours or so of uptime, with regular use of the KDE desktop, that the swap was slowly consumed. In general, under typical load 'htop' would show 500MB used, 1500MB cache (of 2GB physical), and 'x' amount of swap used, where 'x' slowly increased over the time the system was active.

This suggests something is leaking memory, or leaking something. But the weird thing is that it is not showing up as memory allocated to a process in 'htop' or 'top'.

While the system was up the swap was slowly but steadily consumed. When I hit 100% swap usage, a normal system activity (starting Oracle net-listener) caused the classic disk thrashing symptoms described above, and the load average to spike. The system once again became unresponsive and had to be reboot via the magic-sysrq RSEIUB routine.

What I do not understand is that immediately prior to this incident, while swap was close to 100% full, 1500MB of memory was still shown as cached. Surely if memory was short cached blocks would be evicted?

Can someone explain this behaviour to me? I do not understand the VM subsystem well enough to make sense of this situation, where most of memory is used for cache, but something is leaking in some way and filling swap.

Mysteries include:
- What is the disk activity when the incident occurs? Clearly this disk I/O starves other processes and causes the high load average. I have no idea what is causing this? Paging of some sort? Even with no swap enabled?
- How can swap slowly be consumed without the memory usage being accounted to some user-space process? I believe if I restart my X session, the accumulated leak is freed, but I am not 100% sure of this.
- How can we have 75% of physical memory used for cache, yet experience what appears to be an out of memory event of some sort? Why is the cache not evicted?
- Why is swap even used when there is cache in memory? Shouldn't the VM subsystem evict the cached blocked before hitting swap?

Revision history for this message
Bobbi Manners (bobbi.manners) wrote :

One more datapoint - system has been up 1 day almost exactly.

Sitting idle in KDE session with just Firefox running:

htop gives 621MB used of 2004MB physical
Swap sits at 540 / 1027MB used

top gives me:

top - 19:38:12 up 1 day, 34 min, 1 user, load average: 0.90, 0.50, 0.47
Tasks: 173 total, 2 running, 170 sleeping, 0 stopped, 1 zombie
Cpu(s): 5.2%us, 3.1%sy, 0.0%ni, 88.8%id, 0.0%wa, 1.5%hi, 1.5%si, 0.0%st
Mem: 2052240k total, 1916520k used, 135720k free, 14600k buffers
Swap: 1052216k total, 553548k used, 498668k free, 1261628k cached

So I have 1261628k cached, but the swap is still slowly being eaten up. When swap hits 1GB, I'll have another episode of disk thrashing and have to reboot.

Revision history for this message
VladIonescu (vladimir-ionescu) wrote :

I can confirm this behaviour on both my laptop and my work PC.
My PC has 2GB RAM and I have configured a 4GB swap partition. With KDE 4.2, compositing turned off and only a firefox tab htop says 2.7GB of swap are used. That cannot be normal. My RAM is only at 450MB (excluding cached memory).

This is the output of free:
             total used free shared buffers cached
Mem: 2051528 1763172 288356 0 12576 1290156
-/+ buffers/cache: 460440 1591088
Swap: 4257184 2774168 1483016

Swap usage is increasing only during normal use (I left the computer on over the weekend and it stayed constant), but once at a value it never decreases. Once the swap partition is full I experience an extremely high hard disk activity, causing the system to become unresponsive.

Revision history for this message
VladIonescu (vladimir-ionescu) wrote :

In my case restarting the X.org server clears the swap.

Revision history for this message
Bobbi Manners (bobbi.manners) wrote :

Having done some more investigation I am quite confident that this is a problem with the Intel X-Server:
xserver-xorg-video-intel 2:2.6.3-0ubuntu9

When UXA is selected in the xorg.conf as follows:

Section "Device"
        Identifier "Configured Video Device"
        Driver "intel"
        Option "FramebufferCompression" "on"
        Option "AccelMethod" "UXA"
        Option "Tiling" "No"
EndSection

Then the X server slowly leaks memory until all swap is consumed, at which point the system becomes unavailable.

If I restart the X server, all the leaked memory is recovered and swap is empty once again.

Note that UXA is not the default AccelMethod - EXA is. I created an xorg.conf myself to enable UXA in order to get acceptable video performance. EXA performance is just awful in this X server for some reason (can't even enable Compiz or run Google Earth with EXA).

This problem looks similar to bugs #369759 and #360319. Note the latter bug report claims that his happens with EXA rendering also.

Revision history for this message
Bobbi Manners (bobbi.manners) wrote :

Changed packaged to xserver-xorg-video-intel.

affects: linux (Ubuntu) → xserver-xorg-video-intel (Ubuntu)
Bryce Harrington (bryce)
description: updated
Bryce Harrington (bryce)
tags: added: high-cpu
Revision history for this message
Nizamov Shawkat (nizamov-shawkat) wrote :

I also experience the same bug on notebook with i945 chipset with Ubuntu Jaunty.
Symptoms are the same - at some point system becomes irresponsive, disk activity is such high that the only option is to power-off the notebook.

As I was experiencing it rather frequently I had open consoles with dstat, htop and atop running. dstat show that activity is completely "read", not "write". htop shows load average near to 25/18/8 with lot of processed in "D" state (waiting for IO, AFAIK). Memory is about 500 Mb consumed (+1Gb cache) of 1.5 Gb total. Atop show me that there were *several* programs fighting to read the disk - the most prominent was okular (actually right in this case bug was triggered by fast listing the 400 page 2Mb pdf), but firefox, thunderbird and X were also having a high read activity!

After killing okular (took a minute to quit from htop to bash) normal desktop responsivenes returned. But bug is *not* in okular - different times different programs trigger this behaviour.

My systems is jaunty, i386 arch. The behaviour is the same with kde4.2.2 and kde4.2.3 (installed from kubuntu-experimental ppa). I upgraded the xserver-xorg-video-intel from xorg-edgers ppa - still the same. And I did not not touch the xorg - right now it is UXA, see the attached xorg.log.
I don't know the default for pure jaunty.

PS now i know for sure - it is not a beagle to blame, as i expected

Revision history for this message
Chow Loong Jin (hyperair) wrote :

I'm confirming this. My restart X cycle happens at approximately once every 5 or 6 hours of usage. The symptoms are exactly as mentioned, but with an addition. /proc/dri/0/gem_objects shows a large number of GEM objects allocated, but the number of GEM objects does not really matter. What really matters is the memory consumption (the figure on the second line) of the GEM objects. When this hits approximately 2G, my system will almost slow to a halt. Starting or closing any application becomes a pain, and almost any action in particular can cause intense swapping to occur (I'm assuming that the large disk I/O is swapping, because iotop does not show anything). The kernel I am running is 2.6.30 rc5, from http://kernel.ubuntu.com/~kernel-ppa/mainline/. But that said, I don't believe this to be a kernel bug. If it was, I don't think restarting Xorg would cause all the cached memory (as shown in by free) to be freed up, as well as the swap. I've also found that the following patch to libdrm can slow down the memory leak issues considerably, but not stop it:
Index: drm-snapshot-2.4.11+git20090519.f355ad89/libdrm/intel/intel_bufmgr_gem.c
===================================================================
--- drm-snapshot-2.4.11+git20090519.f355ad89.orig/libdrm/intel/intel_bufmgr_gem.c 2009-05-20 14:56:16.000000000 +0800
+++ drm-snapshot-2.4.11+git20090519.f355ad89/libdrm/intel/intel_bufmgr_gem.c 2009-05-20 14:57:20.000000000 +0800
@@ -83,7 +83,7 @@
 /* Only cache objects up to 64MB. Bigger than that, and the rounding of the
  * size makes many operations fail that wouldn't otherwise.
  */
-#define DRM_INTEL_GEM_BO_BUCKETS 14
+#define DRM_INTEL_GEM_BO_BUCKETS 0
 typedef struct _drm_intel_bufmgr_gem {
     drm_intel_bufmgr bufmgr;

Changed in xserver-xorg-video-intel (Ubuntu):
status: New → Confirmed
Revision history for this message
Zack Evans (zevans23) wrote :

There were lots of changes to this area of DRM in 2.6.30rc7 and rc8 - could be worth trying this kernel and seeing if the problem goes away? If it does I don't think it will be too hard to identify the exact changes which fixed it...

Revision history for this message
Bobbi Manners (bobbi.manners) wrote :

I can confirm that this still occurs with 2.6.20rc7:

bob@gecko2:~$ uname -a
Linux gecko2 2.6.30-020630rc7-generic #020630rc7 SMP Sun May 24 01:38:23 UTC 2009 i686 GNU/Linux

Revision history for this message
Bobbi Manners (bobbi.manners) wrote :

Sorry - I meant 2.6.30-rc7 of course!

Revision history for this message
Mike.lifeguard (mikelifeguard) wrote :

2.6.30 was released - give it a try?

Revision history for this message
Bobbi Manners (bobbi.manners) wrote :

I am just downloading 2.6.30. I will test and post my results in due course.

Revision history for this message
Rocko (rockorequin) wrote :

I don't know if this is specifically an intel bug. I sometimes see the same thing occur and I'm using nvidia.

It happened again just now: the PC was doing nothing all night, but when I returned to it and pressed a key to stop the screensaver, X had become unresponsive due to disk thrashing. I could actually see X trying to redraw the screen background pixel by pixel (it finally gave up). The unresponsiveness showed no sign of abating after 20 minutes, but I eventually managed to ssh in. The culprit must have been firefox (1.2GB usage according to atop) because killing it fixed the problem. Oddly around 900 MB of swap was being used even though the top two memory processes were FF and VirtualBox, using only around 2.5 GB out of my 4 GB RAM. The 900 MB swap could have been left over from a swap load test I did 36 hours earlier - but I would have expected the swap to be freed once no longer required sometime in those 36 hours.

For reference, I found that the 2.6.28 kernel would always 'lock up' as soon as swap was hit, eg if I tried to run two VM's using 1.5 GB each. So I have been using the 2.6.30 kernel with the patch in http://bugzilla.kernel.org/show_bug.cgi?id=12309#c366. My initial testing showed that the 2.6.30 kernel so far generally works much better when swap is hit (eg by running two or more VMs), and the patch improves things further.

So there seem to be several problems: (1) memory leaks in some programs, (2) poor management of swap, (3) poor management of scheduling under heavy disk I/O.

Revision history for this message
Bobbi Manners (bobbi.manners) wrote :

Hi there - just a quick update.

I have been running with 2.6.30 for a couple of days now, with a relatively light workload under KDE4. My overall impression is that this problem has either been fixed or that the leak which is probably the root cause of this is at least much slower now!

After nearly two days:
  07:31:35 up 1 day, 18:36, 1 user, load average: 0.35, 0.62, 0.61

Swap usage is just 74MB:

bob@gecko2:~$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r b swpd free buff cache si so bi bo in cs us sy id wa
 1 0 76344 90648 26880 1053644 0 0 4 4 31 126 5 2 92 0

Here is the running kernel version:

bob@gecko2:~$ uname -a
Linux gecko2 2.6.30-020630-generic #020630 SMP Wed Jun 10 09:45:40 UTC 2009 i686 GNU/Linux

Testing continues ...

Revision history for this message
Nizamov Shawkat (nizamov-shawkat) wrote : Re: [Bug 367377] Re: High load average, disk read, no apparent reason - 2.6.28-11

I am also confident that bug had gone after upgrading to kernel 2.6.30
and using intel video drivers (from x-swat ppa) that default to older
acceleration architecture (EXA).

Revision history for this message
Zack Evans (zevans23) wrote :

Been looking through changelogs for the version Bob Manners and I have been going through... I can see there are several memory leaks that might have been the culprit that have been fixed recently, or that had been fixed in drivers way in advance of what I was running.

Bob: your bug, but suggest we close this one and if we find more leaks, open another one?

Nizamov: UXA works reasonably well for me now in Jaunty as long as I use xorg-edgers stuff - any particular reason you went back to EXA - did you have other problems?

Revision history for this message
Zack Evans (zevans23) wrote :

Sorry to double-post but just realised I lied. :-)

Since the bug was filed against 2.6.28-11 and we've gone around it by going to .30... what we gonna do about a fix for official Jaunty?

Revision history for this message
Bryce Harrington (bryce) wrote :

This sounds a lot like bug #376092, which I'm about to upload the patch for, along with the patch for bug #360319 (another memory leak). I'm going to dupe this to #376092. (Both of these are actually bugs in the intel portion of mesa, rather than directly in the -intel DDX driver.)

Regarding a fix for Jaunty, we shipped EXA as the default there, so that reduces the priority on backports of UXA fixes. I'm also a bit leery of backporting mesa fixes to Jaunty because we've found the code to be pretty brittle; several times we backported changes which seemed safe at first but later were suspected to cause freezes, low perf, etc. on different hardware, so I'm concerned about the risk of introducing a regression inadvertently.

Revision history for this message
Zack Evans (zevans23) wrote :

Bryce: Yep, I think this is now effectively a dupe, and I'd forgotten this only affected EXA. Thanks, another one off the list!

Revision history for this message
Zack Evans (zevans23) wrote :

Bryce: Yep, I think this is now effectively a dupe, and I'd forgotten this only affected UXA. Thanks, another one off the list!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.