autofs5-ldap regularly crashes after long periods of time

Bug #1332203 reported by Paul Kilgo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
autofs5 (Ubuntu)
New
Medium
Unassigned

Bug Description

Over the past few months we noticed our web server would no longer serve requests out of directories normally mounted by autofs, which is configured to be backed by LDAP. We knew a reboot would fix the problem temporarily but it would recur after several weeks. We just discovered the problem in the logs:

Jun 19 04:59:07 www1 kernel: [732120.739266] do_general_protection: 9 callbacks suppressed
Jun 19 04:59:07 www1 kernel: [732120.739272] automount[5810] general protection ip:7f9816a663b1 sp:7f9809fab760 error:0 in libc-2.15.so[7f9816a1b000+1b5000]
Jun 19 04:59:10 www1 kernel: [732123.235552] init: autofs main process (931) killed by SEGV signal
Jun 19 04:59:10 www1 kernel: [732123.235602] init: autofs main process ended, respawning

After the crash, autofs does not respond to restarts via the init scripts. It may after cleaning up the appropriate pid files and such, but I'm not sure what those are.

I examined the .crash file and from a backtrace I assume it must be something wrong with the LDAP portion of autofs:

$> apport-retrace -g automount.crash
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
...<snip>...
Reading symbols from /usr/sbin/automount...(no debugging symbols found)...done.
[New LWP 5810]
...<snip>...
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/automount -DNFS=fstype=nfs4,rsize=32768,wsize=32768 -DNFSv3=fstype=nf'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f9816a663b1 in _IO_vfprintf_internal (s=<optimized out>, format=<optimized out>, ap=<optimized out>) at vfprintf.c:1630
1630 vfprintf.c: No such file or directory.
(gdb) bt
#0 0x00007f9816a663b1 in _IO_vfprintf_internal (s=<optimized out>, format=<optimized out>, ap=<optimized out>) at vfprintf.c:1630
#1 0x00007f9816a691a4 in buffered_vfprintf (s=0x7f9816dd5180, format=<optimized out>, args=<optimized out>) at vfprintf.c:2313
#2 0x00007f9816a63bde in _IO_vfprintf_internal (s=0x7f9816dd5180, format=0x7f9815af9ab8 "%s: lookup(ldap): query failed for search dn %s: %s", ap=0x7f9809fae5b8)
    at vfprintf.c:1316
#3 0x00007f9816b24b08 in ___vfprintf_chk (fp=0x7f9816dd5180, flag=1, format=<optimized out>, ap=<optimized out>) at vfprintf_chk.c:35
#4 0x00007f9815af0996 in log_error () from /usr/lib/autofs/lookup_ldap.so
#5 0x00007f9815ae6276 in ?? () from /usr/lib/autofs/lookup_ldap.so
#6 0x00007f9815ae7141 in ?? () from /usr/lib/autofs/lookup_ldap.so
#7 0x00007f9815ae73fe in ?? () from /usr/lib/autofs/lookup_ldap.so
#8 0x00007f9815ae77eb in ?? () from /usr/lib/autofs/lookup_ldap.so
#9 0x00007f9815aea641 in lookup_mount () from /usr/lib/autofs/lookup_ldap.so
#10 0x00007f98177935f5 in lookup_nss_mount ()
#11 0x00007f9817789c8f in ?? ()
#12 0x00007f981713ee9a in start_thread (arg=0x7f9809fb2700) at pthread_create.c:308
#13 0x00007f9816b0f3fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()

I've attached the .crash file. I'll note that I added the "Package:" line myself to be able to use it with apport-retrace.

I'm still not sure what reproduces the bug if it's something in our LDAP database or not. I was hoping maybe it's some small problem in log_error() that maybe was meaning to tell me something that's wrong with LDAP.

Versions:

Ubuntu: 12.04
autofs5: 5.0.6-0ubuntu5.1
autofs5-ldap: 5.0.6-0ubuntu5.1

Revision history for this message
Paul Kilgo (paulkilgo) wrote :
Revision history for this message
Paul Kilgo (paulkilgo) wrote :

This started happening almost daily so we are looking for a fix.

I'm starting to wonder if this is actually a duplicate of Bug #593603

In any case I'm working on figuring out what in our site's LDAP database is causing the crash, if anything.

Paul Kilgo (paulkilgo)
information type: Public → Private Security
information type: Private Security → Public
Robie Basak (racb)
Changed in autofs5 (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Paul Kilgo (paulkilgo) wrote :

Just to make sure all the information is together, I noticed earlier there was a piece of a log message in that stack trace I posted. I tried to see what I could do to piece together part of it:

  "%s: lookup(ldap): query failed for search dn %s: %s"

became

  "Invalid DN syntax: lookup(ldap): query failed for search dn %s: %s"

I was unable to get the other two substitutions since it's difficult to pull information out of the va_list. I have a suspicion that one of them may be an empty string or NULL. That may all be site-specific problems, but I grepped through the autofs5 code and saw that message corresponds to line 396 of `modules/lookup_ldap.c`:

 395 error(logopt,
 396 MODPREFIX "query failed for search dn %s: %s",
 397 this->basedn, ldap_err2string(rv));

which is the error handling side of an ldap_search_s call:

 374 rv = ldap_search_s(ldap, this->basedn,
 375 scope, query, attrs, 0, &result);

I don't have the debug symbols for lookup_ldap.so so I'm unable to go much further than this.

In the meantime we are also running automount through strace (per bug Bug #593603 but with `strace -f`) to see if we can figure out what that log message is.

Revision history for this message
Paul Kilgo (paulkilgo) wrote :

This problem has continued over the past few months. We slogged through the strace output a few times and just today found this:

[pid 935] sendto(11, "<27>Sep 22 13:06:30 automount[92"..., 90, MSG_NOSIGNAL, NULL, 0) = -1 ECONNREFUSED (Connection refused)
[pid 935] close(11) = 0
[pid 935] socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 4
[pid 935] connect(4, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ECONNREFUSED (Connection refused)
[pid 935] close(4) = 0
[pid 935] ioctl(3, 0xc0189379, 0x7f22fecabb20) = 0
[pid 935] ioctl(3, 0xc0189375, 0x7f22fecabb20) = 0
[pid 935] sched_yield() = 0
[pid 935] umount("/home", 0) = -1 EBUSY (Device or resource busy)
[pid 935] nanosleep({0, 200000000}, NULL) = 0
[pid 935] umount("/home", 0) = -1 EBUSY (Device or resource busy)
[pid 935] nanosleep({0, 200000000}, NULL) = 0

Our syslog sends some events to another loghost and the top call to sendto(), which appears to fail, and automount tries to go down gracefully. Checking the .crash file again, the backtrace is identical. The backtrace does seem to support this story since there is a call to vsyslog() inside of log_error().

So this bug may be filed under the wrong package, depending on whose responsibility it is to handle this kind of error. Reading the man page for vsyslog(), I don't see any way the caller can catch an error so this could in fact be a problem with it.

We are using rsyslog 5.8.6-1ubuntu8.9 currently.

Revision history for this message
Paul Kilgo (paulkilgo) wrote :

Just a note that Bug #1176020 is also describing a side effect of this same problem (same web server in question here).

We are disabling our remote loghost and seeing if that theory holds.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.