Prevent extended periods of thrashing

Bug #27441 reported by Ole Laursen
52
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Undecided
Unassigned
linux-source-2.6.15 (Ubuntu)
Invalid
Medium
Unassigned
linux-source-2.6.22 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Two days ago I clicked a torrent link in Epiphany which for some reason made one of the desktop applications eat a lot of memory. So the machine suddenly froze and started trashing. I waited about 10 minutes, then gave up, pressed the power button and left. When I got back the next morning the machine had left swap hell, noticed the power button press and turned itself off.

This made me think that perhaps it would be a good idea to somehow do something to prevent having the kernel spend more than 10 minutes with the desktop in a totally frozen state. IMHO freezing everything for more than a few seconds does not make any sense on a desktop machine. It might be enough to set the ulimit settings to something sensible, e.g. so that no application can eat more than 90% of the RAM or more RAM than there will still be say 100 MB available for the desktop on the machine.

I'm experiencing this on a fully upgraded Breezy Badger, the kernel seems to be 2.6.12-10-686.

I previously reported this as bug #27392 for the kernel, but the maintainer rejected the notion that the kernel could do anything about it and suggested I filed a new bug. And yes, I realise that in some cases a default process limit will be wrong. I'm not arguing that everyone should have these settings forced down their throats, I'm arguing that the defaults are wrong. It is a lot easier to remove a protection if you need it than it is to protect the system yourself, and most people probably don't run real memory hogs like scientific simulations.

Revision history for this message
Matthew Lange (matthewlange) wrote : Re: Thrashing hell
Revision history for this message
Martin Bergner (martin-bergner) wrote :

Hi, is this still a problem on Edgy or Dapper?

Revision history for this message
Ole Laursen (olau) wrote :

I believe so, but unfortunately I don't have access to a Ubuntu installation with swap for the time being. But I've made little test program that allocates memory to reproduce the problem.

Compile it with (sole requirement is a C compiler):

  gcc -Wall memhog.c -o memhog

Then test with "./memhog mbs-to-allocate", e.g. "./memhog 400" to eat up 400 MBs of memory. The program sits tight on the memory for 60 seconds, then quits.

Ole Laursen (olau)
description: updated
Changed in linux-source-2.6.12:
assignee: martin-bergner → nobody
Revision history for this message
Ole Laursen (olau) wrote :

The thrashing hell has happened to me a couple of times during the last few months. At one point I simply left the machine (it had been thrashing for half an hour) and came back next day. Another time it stopped after 5-10 minutes. This is on Ubuntu 6.10.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux-source-2.6.15 (Ubuntu) because there has been no activity for 60 days.]

Revision history for this message
Anders Rune Jensen (anders-gnulinux) wrote :

This is still a problem in Ubuntu 7.10. I really think this is a major defect in Ubuntu. Back in the old days there was a nice oom-killer that simply just killed the process which lead to this situation. It's incredible frustrating to have the machine go into swap-hell and the only thing you can do is wait ~30 min or restart the computer. The problem should be reproducable on all machines, there's a test case, so why is there no interest in fixing this?

I have 2gb of memory so it's not like I have too little, so it is a problem for more or less everyone if you have a process which leaks over time (firefox, monodevelop comes to mind here).

Revision history for this message
Caroline Ford (secretlondon) wrote :

The Hardy Heron Alpha series is currently under development and contains an updated version of the kernel. You can download and try the new Hardy Heron Alpha release from http://cdimage.ubuntu.com/releases/hardy/ . You should be able to test the new kernel using the LiveCD. If you can, please verify if this bug still exists or not and report back your results. General information regarding the release can also be found here: http://www.ubuntu.com/testing/ . Thanks.

Please answer these questions:

* Is this reproducible?
* If so, what specific steps should we take to recreate this bug?

This will help us to find and resolve the problem.

Changed in linux-source-2.6.22:
status: New → Incomplete
Revision history for this message
Ole Laursen (olau) wrote :

What makes you think the problem would be gone?

Does the LiveCD contain a C compiler? Otherwise, if you have access to the new kernel, you could try my instructions above.

Let me spell it out. Install GCC with synaptic (or "apt-get install gcc"). Fetch the test program I attached above. Then run the following command from a shell:

  gcc -Wall memhog.c -o memhog

Make sure swap is activated. Then run "./memhog megabytes-to-allocate", e.g. "./memhog 400" to eat up 400 MBs of memory. The program sits tight on the memory for 60 seconds, then quits. You'll need to adjust the amount according to how much memory you have on your system. You can take a look with

  free -m

This will take you maybe 10 minutes.

Also why did this bug report end up on the kernel? Ben Collins on the previous bug clearly stated that he didn't thought it was a kernel problem. No matter what it's probably first and foremost a question of providing sensible configuration defaults.

Revision history for this message
Safar (faisal-itcompletes) wrote :

This just happened to me on Gutsy with kernel 2.6.22-14-generic

I have 2GiBs of ram and 1.5GiBs swap space

I was watching the system monitor at the time it happened. I wanted to start a process and to monitor its memory usage pattern.

Before the process the user memory usage was at about 20% and the swap space was at about 5%. As when the process gets going the surprising the user memory dropped down to 10% and the swap usage jumped to at 10%. and then it froze, the system got into thrashing hell. I did not see a user memory usage increasing.

After that I could not even logged onto a terminal. waited for about 15 minutes. and then powered off the system. (It probably has affected hard drive badly, since heavy use was underway, but I had to use the desktop, it was start of my work day).

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi,

Here is a simple shell script that will monitor you system's activities into two log files, vmstat.log and top.log. It will by default run for 600 seconds, but one can specify the number of seconds to run for if required. This can possible help you track down the rogue memory hogger.

vmstat.log will show you the general overview of system activity.
top.log will show per-process activity

#!/bin/sh
vmstat 1 > vmstat.log &
vmstatpid=$!
top -b -d 1 > top.log &
toppid=$!
if [ x$1 = x ]
then
        secs=600
else
        secs=$1
fi
sleep $secs
echo $vmstatpid $toppid
kill -KILL $vmstatpid
kill -KILL $toppid

Revision history for this message
Ole Laursen (olau) wrote :

Colin, note that this bug is not about tracking down faulty applications. It's about adding a default protection to insulate the system from the faulty applications.

Revision history for this message
Sergio Zanchetta (primes2h) wrote :

The 18 month support period for Gutsy Gibbon 7.10 has reached its end of life -
http://www.ubuntu.com/news/ubuntu-7.10-eol . As a result, we are closing the
linux-source-2.6.22 kernel task. It would be helpful if you could test the
new Jaunty Jackalope 9.04 release and confirm if this issue remains -
http://www.ubuntu.com/getubuntu/releasenotes/904overview. If the issue still exists with the Jaunty
release, please update this report by changing the Status of the "linux (Ubuntu)"
task from "Incomplete" to "New". Also please be sure to run the command below
which will automatically gather and attach updated debug information to this
report. Thanks in advance.

apport-collect -p linux-image-2.6.28-11-generic 27441

Changed in linux-source-2.6.22 (Ubuntu):
status: Incomplete → Won't Fix
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Ole Laursen (olau) wrote :

Problem remains.

It's a general Linux problem. Here's another annoyed guy who dubs it "the Achilles heel" of Linux, comparing it to (of all things) Windows 98:

http://www.linuxquestions.org/questions/linux-kernel-70/swap-thrashing-can-nothing-be-done-612945/

If you can fix this, I think it's worthy of a Slashdot story. The OOM killer did call for a lot of news stories back in the days. It's not easy to fix, but I think with complete control over the OS as is the case with Ubuntu, it's possible.

Also I guess it's a security problem on servers. You don't need root access to a machine, you just need shell access and a memory eating process, and you're ready to take down the machine. :)

So just to reiterate: a faulty process can allocate enough memory to push Linux into thrashing, constant paging out to the disk. This behaviour can continue for more than 30 minutes without any progress. Meanwhile the machine is unusable, doesn't respond to input of any kind. So the objective is to put in default safety guards to prevent this from ever happening by denying the faulty process more memory or terminating it, e.g. after having detected that the past say x seconds were spent thrashing.

Ole Laursen (olau)
summary: - Thrashing hell
+ Prevent extended periods of thrashing
Changed in linux (Ubuntu):
status: Incomplete → New
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Brandon Thomson (gravix) wrote :

When this bug hits, I feel like I am using Windows 3.1 again. The whole point of switching from cooperative to pre-emptive multitasking was to keep poorly-written apps from bringing down the whole system.

Revision history for this message
jk (mail-j-k) wrote :

This happens to me quite frequently. When it does, it starts in the morning at about 7 am, and I have no idea at all what's causing it. I don't have any cronjobs scheduled for that time. It started years ago, some time after I switched to using hard disk encryption. I've never seen the problem before that.

Load goes up to 4 or 5, but when checking with "top", there is not one single process in the list which uses more than a few percent of CPU time. Most of the times "top" itself is the biggest CPU user. There has never been anything remotely interesting in the log files.

The effects are like those of a fork bomb: In the beginning I can still issue some commands, but pretty soon the machine becomes completely unresponsive. It keeps answering to pings and all the routing seems to work, but I don't even get to SSH login any more.

Sometimes the machine recovers after 30-60 minutes. When I need to use it, I have to power it off with the power switch. Any way, afterwards it works normally again, with no trace left of what had just happened.

This phenomenon has persisted through several reinstalls of the OS, so I'm pretty sure it's not due to some rootkit infection. Also doesn't seem directly related to any HDD, since this has happened with various system drives I've had over the years.

Revision history for this message
John Kennedy (legendre17) wrote :

It's indeed very easy to get thrashing---in my case, 2GB RAM running a couple of memory-intensive problems like Mathematica and Gimp Resynthesizer at the same time makes the system almost completely unresponsive. I'm talking about Lucid Lynx with all the updates made. I think for a desktop system, responsiveness is probably the most important thing, so users should at least have an option to alleviate this problem, even at the risk of losing some efficiency. If a program misbehaves and thrashing starts, I'd like to be able to kill that program without rebooting and without waiting for hours for the problem to resolve itself.

Could a developer please tell us more about the technicalities involved?

For instance, what exactly happens during the thrashing? I understand the hard drive is running like crazy to move pages in and out of swap space, but surely preemptive multitasking should mean that the access to swap space could be paused while other processes get their share of CPU time, right? Why, then, can the mouse hang completely for minutes? Doesn't this mean that the process controlling the display of the mouse doesn't get to run for minutes at a time? Why is that allowed at all? If it is that that process (or memory associated with it) has been paged out, can't there be a list of priorities for paging, that essentially prohibits the OS from paging out essential UI elements?

tags: added: kernel-fs kernel-needs-review
Revision history for this message
Andy Whitcroft (apw) wrote :

I suspect I have seem similar behaviour on Lucid, and i think update-locatedb is the trigger in my case.

tags: added: kernel-candidate kernel-reviewed
removed: kernel-needs-review
Andy Whitcroft (apw)
tags: removed: kernel-candidate
Revision history for this message
John Kennedy (legendre17) wrote :

Just wondering about the status of this: is anyone looking at the problem? I'd be curious for a comment on how hard this is to fix, and if they're any plans for fixing it soon. Thanks!

Revision history for this message
David Mitchell (a-launchpad-admin-forestit-co-uk) wrote :

I have landed here because I have a 2GB (MAX possible) system which has gone into swap thrash mode. I know what has done it as I have launched about 15 python scripts, however it would seem that one cure would be to slow down the multitasking - at present it seems that a process is scheduled, starts swapping pages in and is then out of it's timeslot before it has loaded all the pages. If there was an option to slow down the task switching so tasks had a chance of actually getting all the pages swapped in and doing some processing before they are pre-empted then this might help! This could be adaptive depending on the amount of wait time on the CPU - at present my system is reporting

Cpu(s): 2.8%us, 2.8%sy, 0.0%ni, 34.3%id, 59.3%wa, 0.0%hi, 0.8%si, 0.0%st
Mem: 1013336k total, 996952k used, 16384k free, 2840k buffers
Swap: 4883448k total, 1810104k used, 3073344k free, 46136k cached

OK yes it only has 1GB at present but even if I put 2GB in it would still be oversubscribed...

(10.04 new install)

Revision history for this message
Tim Gardner (timg-tpi) wrote :

This is starting to be one of those dog pile bugs. I want each of you guys to start your own bug report using 'ubuntu-bug linux'. While the symptoms may appear similar, the root causes could be quite different. Not to mention the fact that this bug was reported against a 5 year old kernel.

Changed in linux (Ubuntu):
importance: Medium → Undecided
status: Triaged → Won't Fix
tags: removed: kernel-fs kernel-reviewed
Revision history for this message
Ole Laursen (olau) wrote :

Hi! I'm the original reporter. This problem still occurs with new kernels, otherwise I would have closed the bug. Tim Gardner, would you please explain whether your last remark applies to me too?

I understand that there's no real momentum behind getting this fixed, it's probably not simple either, but I believe this bug still serves a useful purpose as a landing page for people who encounter the problem.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Ole - please start a new bug (using 'ubuntu-bug linux'). There are any number of reasons why your disk can start thrashing. _Every_ kernel has been different in that regard since about 2.6.27 with all of the changes in CPU and I/O scheduler.

Revision history for this message
John Kennedy (legendre17) wrote :

Tim, I don't think the question is WHY the thrashing starts. I understand that trying to use more than the physical RAM installed on your system will generate paging.

The problem as I see it is that Linux seems to become unusable once thrashing starts. I will open a new bug about this, but I think it won't be very different from Ole's description: the point is that whenever a program allocates too much memory, any interaction with the system becomes impossible (mouse won't even move, terminal won't open, and I'm pretty sure, though I'll have to try this, that even remote connections might become problematic or impossible) and a reboot is necessary (unless the program decides to release or stop using the memory it allocated).

Revision history for this message
John Kennedy (legendre17) wrote :

Sorry for the delay... I opened a new bug report on this, bug #620074.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.