boinc stops on error after a few days (md5_file: Too many open files) in stderrdae.txt

Bug #968021 reported by LeForgeron
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
boinc (Ubuntu)
Fix Released
Medium
Unassigned
Precise
Fix Released
Medium
Daniel Hahler

Bug Description

== SRU Justification ==

Impact : when opened (oneiric), the bug was/is as described: a long time running boinc system would finally fails to compute more boinc work unit as every computed work unit leads to a leak of 1 file descriptor in the boinc main daemon (irrelevant of the kind of project subscribed). The faster the work units were processed, and the lower the limit of file descriptor in the system, the faster the bug happens (usually within one week of uninterrupted uptime, but might go to months on slower systems).
This bug affects all users from Oneiric (boinc 6.12.33+dfsg-1.1ubuntu0.1) to 7.0.27 (excluded, that one is fine).
(well, 7.0.23, 7.0.24 & 7.0.25 have another issue: computation error, no more leak; 7.0.26 not tested)

Test case: easy, but very long: run boinc for at least one complete work unit (according to project, the unit can be 5 minutes to many hours), then use "lsof" on the boinc daemon and check the end of the listing. When more units have been processed, the list reported by "lsof" should not be longer than before. Computation of each unit must succeed.

Regression Potential: I do not know the change, I cannot discuss the impact. But boinc must be able to run unattended for months without such problem, and without reboot, especially on a LTS.

== Original Description ==

There seems to be a file descriptor leaks in the boinc process (client side).

After a few days of fine loading the system, it would suddenly stop working.

Relaunching it is usually ok (but actively managing a system running boinc is rather not a decent solution).

Clue with the following command:

$ sudo lsof -p `pidof boinc`

The number of open file descriptor will keep increasing as boinc tasks are completed. (more visible when the projects have fast tasks for the hardware, such as sudoku or milkyway/nvidia)

A lot of entries are like:

boinc 15348 boinc 623r DIR 8,1 4096 29492116 /var/lib/boinc-client/slots/12
boinc 15348 boinc 624r DIR 8,1 4096 29492173 /var/lib/boinc-client/slots/13
boinc 15348 boinc 625r DIR 8,1 4096 29492116 /var/lib/boinc-client/slots/12
boinc 15348 boinc 626r DIR 8,1 4096 29492084 /var/lib/boinc-client/slots/8
boinc 15348 boinc 627r DIR 8,1 4096 29492085 /var/lib/boinc-client/slots/9
boinc 15348 boinc 628r DIR 8,1 4096 29492116 /var/lib/boinc-client/slots/12
boinc 15348 boinc 629r DIR 8,1 4096 29492173 /var/lib/boinc-client/slots/13
boinc 15348 boinc 630r DIR 8,1 4096 29492116 /var/lib/boinc-client/slots/12
boinc 15348 boinc 632r DIR 8,1 4096 29492018 /var/lib/boinc-client/slots/2
boinc 15348 boinc 633r DIR 8,1 4096 29492040 /var/lib/boinc-client/slots/4
boinc 15348 boinc 634r DIR 8,1 4096 29492018 /var/lib/boinc-client/slots/2
boinc 15348 boinc 635r DIR 8,1 4096 29492062 /var/lib/boinc-client/slots/6
boinc 15348 boinc 636r DIR 8,1 4096 29492116 /var/lib/boinc-client/slots/12

ProblemType: Bug
DistroRelease: Ubuntu 11.10
Package: boinc 6.12.33+dfsg-1.1ubuntu0.1
ProcVersionSignature: Ubuntu 3.0.0-17.30-generic 3.0.22
Uname: Linux 3.0.0-17-generic x86_64
NonfreeKernelModules: nvidia
ApportVersion: 1.23-0ubuntu4
Architecture: amd64
Date: Thu Mar 29 08:59:09 2012
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release amd64 (20111012)
PackageArchitecture: all
SourcePackage: boinc
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
LeForgeron (jgrimbert) wrote :
Revision history for this message
Daniel Hahler (blueyed) wrote :

Thank you for this report.

Can you try this with version 7.0.15 of BOINC, which is available in Ubuntu Precise (development branch) or via the pkg-boinc PPA at https://launchpad.net/~pkg-boinc/+archive/testing , please?

I cannot confirm this on a box where the BOINC process runs since March 15th - there are only MEM regions matching "slot" in the lsof list.

Maybe this is caused by a specific project in your list? I am neither running sudoku nor milkyway on my host(s).

Changed in boinc (Ubuntu):
status: New → Incomplete
importance: Undecided → Medium
Revision history for this message
Gianfranco Costamagna (costamagnagianfranco) wrote :

Daniel I'm gonna packaging 7.0.23 from debian git, maybe is better for testing... :)
I'm talking with Steffen just now for some patches...

Revision history for this message
LeForgeron (jgrimbert) wrote :

I tested the PPA version: far better, the number of opened files seems to stay at 22 (rather than climbing to hundreds as tasks are performed) (22 or less...).

I doubt the original issue would be caused by the project, as projects seems to have their own process, each new task being performed in a new process. Only the central managing process (boinc) seems to have a long lasting existence (But I do not know the internal of boinc, so I might be wrong)

Revision history for this message
Gianfranco Costamagna (costamagnagianfranco) wrote :

https://code.launchpad.net/~costamagnagianfranco/+archive/boinc
I'm building 7.0.23 in my persona archive, and it will be available in a few hours.

Please try this never version, from latest upstream 7.0.23 and few patches from Steffen (and a review from me, to fix a build error)

Revision history for this message
Steffen Möller (moeller-debian) wrote : Re: [Bug 968021] Re: boinc stops on error after a few days (md5_file: Too many open files) in stderrdae.txt

On 03/29/2012 11:25 PM, LocutusOfBorg wrote:
> https://code.launchpad.net/~costamagnagianfranco/+archive/boinc
> I'm building 7.0.23 in my persona archive, and it will be available in a few hours.
>
> Please try this never version, from latest upstream 7.0.23 and few
> patches from Steffen (and a review from me, to fix a build error)

I have now updated that patch with the omitted str_replace inclusion..

Concerning that stop, I made bad experiences on servers when having
a working $DISPLAY in the ssh session, e.g. because of the -X argument to ssh.
The BOINC client then happily uses the X screen saver library to find
reasons not to compute. Could you verify that not to be the cause of your
observation?

Steffen

Revision history for this message
LeForgeron (jgrimbert) wrote :

I'm not using ssh session. So, I'm not able to verify that part.

If comparaison can matters, the Lucid version does NOT show the same behaviour as Oneiric (stop after a while with md5 complaining about too many opened files).

best regards (other test with 7.0.23 in progress soon)

Revision history for this message
LeForgeron (jgrimbert) wrote :

Well, 7.0.23 seems to have issue running sudoku : the tasks went quickly as error. Other projects might have too, but they are far longer (sudoku take about 10 to 20 minutes per tasks, other projects need 5 to 7 hours)

Reverting to 7.0.15, sudoku seems fine.
(both were tried with freshly downloaded task: 7.0.23 downloads tasks for itself, and same for 7.0.15, so it is not compatibility issue in the storage due to mixed them)

I'm sticking to 7.0.15 so far (at least for the weekend)

Revision history for this message
LeForgeron (jgrimbert) wrote :

It seems that 7.0.15 does not like being run without X session ( ?! ?)

Revision history for this message
LeForgeron (jgrimbert) wrote :

And here the ends of stdoutdae.txt (same issue: closed X session around 17:00... no work done)

Revision history for this message
Gianfranco Costamagna (costamagnagianfranco) wrote :

I think building boinc without X is possible, but for the moment (the other way is to ship ubuntu/debian with another boinc package for pc without graphic) boinc is shipped with X dependencies, in order to fix the "idle movement" bug.

Revision history for this message
LeForgeron (jgrimbert) wrote :

Local conclusion/workaround : I reverted to oneiric version, and added a crontab entry for root to restart boinc-client everyday.

0 16 * * * /usr/bin/sudo -s /etc/init.d/boinc-client restart

Best regards.

Revision history for this message
Gianfranco Costamagna (costamagnagianfranco) wrote :

could you please try 7.0.24 from my repo?
https://code.launchpad.net/~costamagnagianfranco/+archive/boinc

thanks

Revision history for this message
LeForgeron (jgrimbert) wrote :

Well, Dear Locutus... I did.

It wasted at least 12 working units of Sudoku, all in errors, even after restarting the project.
Milkyway on cuda seems happy.
ABC get wasted too, about 12 working units too.
I did not dare to check other projects, the damage are already enough.

I won't retry that.

Revision history for this message
Gianfranco Costamagna (costamagnagianfranco) wrote :

Ok I had some mails with upstream developers, this bug will be fixed in a future release (I hope in some days I'll have a working patch, not in .25 since it has been already released).

I'll let you know

Changed in boinc (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Daniel Hahler (blueyed) wrote :

@LocutusOfBorg: Can you please elaborate on what the error is? Is there an upstream ticket you could link to?

Changed in boinc (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Gianfranco Costamagna (costamagnagianfranco) wrote :

Sorry I sent a mail directly to Rom and cced in boinc_alpha mailing list, but the archive is only available to the list members.

Rom told me "Okay, the file descriptor leak should be fixed in the 7.0.x builds. I’ll check if the bug fix needs to be back-ported to the 6.12 branch.".

Now I packaged 7.0.25 in my ppa, would be nice to see if the problem has been fixed

https://code.launchpad.net/~costamagnagianfranco/+archive/boinc

(but I don't think positive, since I don't remember this fix in the changelog, I'll look and let you know when I see the fix)

@LeForgeron I package the boinc version taken from debian automatically, so you can set up my ppa and tell when you don't have this problem anymore! This would be so nice.

Revision history for this message
LeForgeron (jgrimbert) wrote :

Same issue with 7.0.25 as with 7.0.23 & 24 : computation error for all units (only wasted 18 WU of sudoku that time) that need the CPU.

Maybe the issue is fixed, but if the computation fails, there is no point in upgrading.

Revision history for this message
LeForgeron (jgrimbert) wrote :

Seems now working with 7.0.27 of ppa; (but menu has the same issue as official release)
At least sudoku does not exit immediately, other work unit in progress.

Revision history for this message
Gianfranco Costamagna (costamagnagianfranco) wrote :

I'm glad to see it's working now.

Let me know if you encounter further troubles with this bug.

7.0.27 is going to be backported into precise soon.

Revision history for this message
Gianfranco Costamagna (costamagnagianfranco) wrote :

(and oneiric as well)

Revision history for this message
LeForgeron (jgrimbert) wrote :

Others are working fine too (ABC & Collatz). No regression on Einstein (GPU only), milkyway(GPU only) nor sudoku(CPU).

Fine for me.

Revision history for this message
Steffen Möller (moeller-debian) wrote :

On 05/22/2012 03:05 PM, LeForgeron wrote:
> Others are working fine too (ABC & Collatz). No regression on Einstein
> (GPU only), milkyway(GPU only) nor sudoku(CPU).
>
> Fine for me.

I experienced that once, too. It was with WorldCommunityGrid, if I am not erroneous. And it is a lllooong time ago.
Let us collect a bit how many of us are experiencing this.

Steffen

Daniel Hahler (blueyed)
Changed in boinc (Ubuntu Precise):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Chris Halse Rogers (raof) wrote :

This bug is missing some required information for an SRU; particularly, the description should have an Impact, Test Case, and Regression potential section as documented in https://wiki.ubuntu.com/StableReleaseUpdates

Also, is this fixed in Quantal?

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Daniel, I've assigned you the precise task since you are listed on the upload's changelog. The description of this bug is not sufficient for us to review the package. Please edit the description and document the testing procedures necessary to verify this bug with "Test Case" and discuss any areas the code affects that might need to be tested in "Regression Potential".

Changed in boinc (Ubuntu Precise):
assignee: nobody → Daniel Hahler (blueyed)
Revision history for this message
Daniel Hahler (blueyed) wrote :

1. It would have helped, if you would have linked to "the upload" (https://launchpad.net/ubuntu/precise/+queue?queue_state=1&queue_text=boinc)
2. the main SRU bug is bug 1009536, which has been edited by Dave to fit the SRU requirements
3. I only have looked at other bugs this update would fix and if there was feedback from PPA users which indicated it, I have added it to the changelog.
4. This is fixed in Quantal, which has 7.0.27 already.

If an updated description would be still required, I would have to ask LeForgeron to add it.

Given the slowness of this particular SRU process I could imagine that it would be better to ask/wait for 7.0.32+ anyway.

Revision history for this message
LeForgeron (jgrimbert) wrote :

Impact : when opened (oneiric), the bug was/is as described: a long time running boinc system would finally fails to compute more boinc work unit as every computed work unit leads to a leak of 1 file descriptor in the boinc main daemon (irrelevant of the kind of project subscribed). The faster the work units were processed, and the lower the limit of file descriptor in the system, the faster the bug happens (usually within one week of uninterrupted uptime, but might go to months on slower systems).
This bug affects all users from Oneiric (boinc 6.12.33+dfsg-1.1ubuntu0.1) to 7.0.27 (excluded, that one is fine).
(well, 7.0.23, 7.0.24 & 7.0.25 have another issue: computation error, no more leak; 7.0.26 not tested)

Test case: easy, but very long: run boinc for at least one complete work unit (according to project, the unit can be 5 minutes to many hours), then use "lsof" on the boinc daemon and check the end of the listing. When more units have been processed, the list reported by "lsof" should not be longer than before. Computation of each unit must succeed.

Regression Potential: I do not know the change, I cannot discuss the impact. But boinc must be able to run unattended for months without such problem, and without reboot, especially on a LTS.

description: updated
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Please test proposed package

Hello LeForgeron, or anyone else affected,

Accepted boinc into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/boinc/7.0.27+dfsg-5ubuntu0.12.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in boinc (Ubuntu Precise):
status: Triaged → Fix Committed
tags: added: verification-needed
tags: added: verification-done
removed: verification-needed
Revision history for this message
LeForgeron (jgrimbert) wrote :

Greeting happy people,

I tried to "downgrade" from costamagnagianfranco to precise-proposed , by forcing the version, but libboinc is not available on that tag. I wonder if it matter.

Installation in progress.

Revision history for this message
LeForgeron (jgrimbert) wrote :

Hello again

Verification done with:
 boinc 7.0.27+dfsg-5ubuntu012.04.1 (precise-proposed)
 boinc-manager ditto
 boinc-client ditto
 libboinc 7.0.31+dfsg-0-875-precise1 (now) (only available in that version)

It's ok.

Revision history for this message
Gianfranco Costamagna (costamagnagianfranco) wrote :

Hi LeForgeron, libboinc has been introduced after .27 release for various reasons, it's normal that the downgrade shows the libboinc missing, will be (I think) introduced in the next update.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package boinc - 7.0.27+dfsg-5ubuntu0.12.04.1

---------------
boinc (7.0.27+dfsg-5ubuntu0.12.04.1) precise-proposed; urgency=low

  * Stable Release Update to fix multiple bugs:
    - fixes regression with Seti, Docking@Home and maybe others
      (LP: #1009536)
    - fixes crash when adding projects (LP: #1001168)
    - fixes file descriptor leak (LP: #968021)
  * debian/control: enable building of boinc-amd-opencl again (disabled in
    7.0.27+dfsg-4 for Debian specific reasons)

boinc (7.0.27+dfsg-5) unstable; urgency=low

  * debian/rules: use dpkg-buildflags to get default flags.
  * debian/control: build-depend on dpkg-dev >= 1.16.1.1, because
    buildflags.mk and DEB_BUILD_MAINT_OPTIONS are needed.
  * debian/patches/add_hardening_flags.patch: pass CXXFLAGS, CPPFLAGS and
    LDFLAGS to hand-written Makefiles in samples/.

boinc (7.0.27+dfsg-4) unstable; urgency=low

  [ Steffen Moeller ]
  * Not building boinc-amd-opencl, for fglrx-driver is not in testing.
  * Adding hardening flags.
  * Addressed FTBFS on hurd (Closes: #672725)
  * debian/boinc-client.init: increasing priority of boinc client, change nice
    value from 19 to 10.

  [ Guo Yixuan ]
  * Add patch from Rom that fix the project category crash. (Closes: #641593)
  * debian/rules: fix to set sched/transitioner_catchup.php in
    boinc-server-maker executable.
  * Cherry-picking above commits from master to sid branch.
  * Add myself to uploaders.

boinc (7.0.27+dfsg-3) unstable; urgency=low

  * Now truly reconstituting compatibility with SETI
    (Closes: #672328, lp: #991179).
  * builds with gcc 4.7 (Closes: #671999)
    Thanks to Matthias Klose for his patch, had sadly only seen it after
    I patched it myself :o/ The NMU of his should not kick in because
    of the SETI incompatibility.

boinc (7.0.27+dfsg-2) unstable; urgency=low

  * Removed a couple of patches to help preventing crash in some
    scientific apps

boinc (7.0.27+dfsg-1) unstable; urgency=low

  * New upstream release
    - NVidia issue settled, presumably.
    - Some patches adopted by upstream.

boinc (7.0.26+dfsg-1) unstable; urgency=low

  * New upstream release
    - Comes with NVidia patches
    - Other incompatibilities with various projects fixed

boinc (7.0.25+dfsg-2) UNRELEASED; urgency=low

  * Added patch by Bernd Machenschalk to improve the uniqueness
    of clients when multiple instances are run.
  * Helping man page of stripchart.

boinc (7.0.25+dfsg-1) UNRELEASED; urgency=low

  [ Steffen ]
  * New upstream version (not yet with NVidia API improvements).

  [ Thorsten ]
  * Helping warnings in init scripts (Closes: #651278,#651303).
  * Fixing Python dir problem for boinc-server-maker (Closes: #657036)
  * Fixed warning for man page to stripchart.

boinc (7.0.24+dfsg-4) UNRELEASED; urgency=low

  * Improved support of NVidia cards - API incompatibility

boinc (7.0.24+dfsg-3) UNRELEASED; urgency=low

  * Support of clang through DEB_BUILD_OPTIONS

boinc (7.0.24+dfsg-2) UNRELEASED; urgency=low

  * More adjustments of patches.
 -- Daniel Hahler <email address hidden> Thu, 28 Jun 2012 01:17:04 +0200

Changed in boinc (Ubuntu Precise):
status: Fix Committed → Fix Released
Changed in boinc (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Steffen Möller (moeller-debian) wrote :

The original md5 error reported in this thread I have just seen again with 7.0.34. The error sems to be associated with tests for the amount of disk space. I could well assume some concurrency issue to play a role since both the original reporter and me seem to be running many cores in parallel - 12 or 24 here. Rosetta is best at causing issue over here. Steffen

Revision history for this message
LeForgeron (jgrimbert) wrote :

Now running 7.0.34, it does not seem to leak on file descriptor so far (already 10 tasks done).
A collision on disk full (well, allocated space) is probably another issue (best tracked in its own bugreport ?)

7.0.33 was fine too.

My check is :
 $ sudo lsof -p `pidof boinc`

The last file descriptor does not increase as tasks are done (it can raise a bit to send back the result, but later it returns to something always low (like 20 or 30))

As an illustration:

boinc 8538 boinc 0u CHR 1,3 0t0 1029 /dev/null
boinc 8538 boinc 1w REG 8,118 1967980 23 /var/lib/boinc-client/stdoutdae.txt
boinc 8538 boinc 2w REG 8,118 19666 22 /var/lib/boinc-client/stderrdae.txt
boinc 8538 boinc 3wW REG 8,118 0 19 /var/lib/boinc-client/lockfile
boinc 8538 boinc 4w REG 8,118 176550 24 /var/lib/boinc-client/time_stats_log
boinc 8538 boinc 5u CHR 195,255 0t0 38091 /dev/nvidiactl
boinc 8538 boinc 6u CHR 195,0 0t0 38092 /dev/nvidia0
boinc 8538 boinc 7u CHR 195,0 0t0 38092 /dev/nvidia0
boinc 8538 boinc 8u CHR 195,0 0t0 38092 /dev/nvidia0
boinc 8538 boinc 9u IPv4 95350 0t0 TCP *:31416 (LISTEN)
boinc 8538 boinc 10u IPv4 92260 0t0 TCP localhost:31416->localhost:59555 (ESTABLISHED)
boinc 8538 boinc 11u CHR 195,255 0t0 38091 /dev/nvidiactl
boinc 8538 boinc 12u CHR 195,0 0t0 38092 /dev/nvidia0
boinc 8538 boinc 13u CHR 195,0 0t0 38092 /dev/nvidia0
boinc 8538 boinc 14u CHR 195,0 0t0 38092 /dev/nvidia0
boinc 8538 boinc 15r REG 0,15 4224 36018 /run/utmp
boinc 8538 boinc 17u IPv4 89730 0t0 TCP krynn.local:60990->pd2.cs.nctu.edu.tw:http (CLOSE_WAIT)
boinc 8538 boinc 19r REG 0,3 0 4026532036 /proc/interrupts
boinc 8538 boinc 20u IPv4 91746 0t0 TCP krynn.local:60993->pd2.cs.nctu.edu.tw:http (CLOSE_WAIT)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.