Segfaults in nsUrlClassifierDBService.cpp when homedir is inaccessible

Bug #585061 reported by Daniel Richard G.
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mozilla Firefox
Fix Released
Medium
firefox (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

Binary package hint: firefox

This concerns firefox 3.6.3+nobinonly-0ubuntu4 in Ubuntu Lucid.

I use Firefox on a system in which home directories are served from an AFS file server. When a user's Kerberos authentication expires, access to the entire home directory is cut off, and all attempts to read or write to it fail with EACCES.

This occurs regularly on my workstation: I stay logged in for weeks at a time, with a more-or-less permanent instance of Firefox running, and every night after I leave work my authentication expires and Firefox sits for a few hours without any access to its configuration files under ~/.mozilla/. When I return in the morning, I renew my authentication, and my home directory becomes accessible again, but by this point Firefox has usually vanished without a trace.

I debugged this problem using a local build of the Firefox deb source, and have found where the segfaults are occurring: All of them appear to be in nsUrlClassifierDBService.cpp. The code assumes in a couple of places that mConnection is non-NULL, but every time the browser crashes on me, that assumption did not hold.

I am attaching a preliminary patch that, so far, has eliminated the crashes. I now find Firefox still hanging around every morning, with a number of "A script has become unresponsive..." dialogs that are easily cleared away.

Revision history for this message
Daniel Richard G. (skunk) wrote :

This patch can be added directly to firefox-3.6.3+nobinonly/debian/patches/ for testing purposes.

Revision history for this message
David Stansby (dstansby-deactivatedaccount) wrote :

In future, please could you use the patch tag instead of putting patch in the title.

tags: added: patch
summary: - [PATCH] Segfaults in nsUrlClassifierDBService.cpp when homedir is
- inaccessible
+ Segfaults in nsUrlClassifierDBService.cpp when homedir is inaccessible
Revision history for this message
In , Daniel Richard G. (skunk) wrote :

User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.7) Gecko/20100716 Ubuntu/10.04 (lucid) Firefox/3.6.7
Build Identifier: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.7) Gecko/20100716 Ubuntu/10.04 (lucid) Firefox/3.6.7

This issue was originally reported in Ubuntu's Launchpad tracker. I am filing it here since it reaches fairly deep into the code, and is likely not an artifact of Ubuntu's modifications.

    https://bugs.launchpad.net/ubuntu/+source/firefox/+bug/585061

I use Firefox on a system in which home directories are served from an AFS file server. When a user's Kerberos authentication expires, access to the entire home directory is cut off, and all attempts to read or write to it fail with EACCES.

This occurs regularly on my employee workstation: I stay logged in for weeks at a time, with a more-or-less permanent instance of Firefox running, and every night after I leave work my authentication expires and Firefox sits for a few hours without any access to its configuration files under ~/.mozilla/. When I return in the morning, I renew my authentication, and my home directory becomes accessible again, but by this point Firefox has usually vanished without a trace.

I debugged this problem using a local build of (Ubuntu-patched) Firefox, and found where the segfaults were occurring: nsUrlClassifierDBService.cpp. The code assumes in a couple of places that mConnection is non-NULL, but every time the browser crashed on me, that assumption had been broken.

I am attaching a preliminary patch that, for the past few months, has pretty conclusively eliminated the crashes for me. Firefox now always makes it till the morning when I get in, sometimes with a couple "A script has become unresponsive..." dialogs, but otherwise ready to go once I renew my authentication.

This patch isn't meant to be committed as-is, but it should point out the problems with the current code.

Reproducible: Sometimes

Steps to Reproduce:
1. Make heavy use of Firefox (lots of open tabs, lots of background JS running, etc.)

2. Cut off access to $HOME for a few hours, cold-turkey

3. Wait for segfault

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

Created attachment 459993
Preliminary patch

Revision history for this message
In , Bzbarsky (bzbarsky) wrote :

Over to necko for lack of a better option... Benjamin, dcamp, do you know whether anyone maintains URLClassifier at this point?

Revision history for this message
In , Bzbarsky (bzbarsky) wrote :

Jesse says maybe ddahl is planning to look at this code? This could be a nice limited place to start. ;)

Changed in firefox:
status: Unknown → Confirmed
Revision history for this message
In , Benjamin Smedberg (Mozilla) [:bs] (benjamin-smedbergs) wrote :

dwitte and sdwilsh have both proposed rewriting it, although they are both busy folk. I think the original component was correct, though.

Revision history for this message
In , Shawn Wilsher (sdwilsh) wrote :

I can probably review this, but it's going to be at least a week...

Revision history for this message
Micah Gersten (micahg) wrote :

Thank you for reporting this to Ubuntu. I'm going to mark this Triaged since this has been upstreamed already. Thank you for the patch. Please report any other issues you may find.

Changed in firefox (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Nigel Babu (nigelbabu)
tags: added: patch-forwarded-upstream
removed: patch
Revision history for this message
In , Timeless-bemail (timeless-bemail) wrote :

daniel: for the future, please try to use hg diff with --git -p, see https://developer.mozilla.org/en/Mercurial_FAQ

and don't add // foo after }s

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

Timeless, as I said, this patch is not intended to be committed as-is. It only serves to indicate the specific locations where the segfaults are occurring (in addition to providing a simple workaround that proves the point).

Why did you assign this bug to me? I am not a Mozilla developer.

Revision history for this message
In , Bzbarsky (bzbarsky) wrote :

I think timeless was under the impression that the patch was something to aim to check in (in which case the patch author would normally be the assignee; anyone who posts a patch is sorta a "Mozilla developer" if you think about it... ;) ).

Assigning back to default.

Revision history for this message
In , daviddahl (ddahl) wrote :

(In reply to comment #5)
> I can probably review this, but it's going to be at least a week...

sdwilsh: If you can review that would be awesome, we are in heavy blockerland with DevTools features right now.

Revision history for this message
In , Shawn Wilsher (sdwilsh) wrote :

does this mean I don't need to review this?

Revision history for this message
In , Bzbarsky (bzbarsky) wrote :

Well, we need someone to fix the bug. If the attached patch is acceptable, then we should check it in. If not, why not? We should at least get _that_ info in the bug so someone can pick it up as needed.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

There may be a deeper problem going on ("why is mConnection NULL?", "why are these methods being called when mConnection is NULL?"), but for now, just sidestepping the NULL-dereferences would be great.

I would just eyeball the patch, and make appropriate changes manually. It's pretty trivial.

Revision history for this message
In , Shawn Wilsher (sdwilsh) wrote :

(In reply to comment #12)
> There may be a deeper problem going on ("why is mConnection NULL?", "why are
> these methods being called when mConnection is NULL?"), but for now, just
> sidestepping the NULL-dereferences would be great.
For what it's worth, I think we need to figure out those questions first. Blindly wallpapering over the issue is not the way to go. Clearly some invariants held by the code are being broken here, and it could mean that other things are broken as a result.

Revision history for this message
In , Bzbarsky (bzbarsky) wrote :

> why is mConnection NULL?

Probably because nsUrlClassifierDBServiceWorker::OpenDb failed?

> why are these methods being called when mConnection is NULL?

This is a good question. BeginUpdate propagates the error out, but is someone eating it later on or something? nsUrlClassifierStreamUpdater::DownloadUpdates seems to check the return value...

Revision history for this message
In , Bzbarsky (bzbarsky) wrote :

On the other hand, nsUrlClassifierDBService::BeginUpdate uses an async proxy. Wouldn't that lose the error?

Revision history for this message
In , Shawn Wilsher (sdwilsh) wrote :

Ugh, yes.

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

I don't think anyone is against tracking down the "real" problem in this bug. But given that everyone here has plenty on their plates, and that debugging this issue appears non-trivial, putting in a workaround till a fix is found would ease the practical impact of this bug.

(You can have it print a GLib warning or somesuch to the terminal so that devs don't forget it's there. Not like it'll come up terribly often, anyway.)

For my part, my patched Firefox has worked very well---and my use case is pretty punishing. I still get some long-running-instance wonkiness at times, and maybe my changes don't help with that. But compared to a straight-up SIGSEGV, I'll take the wonking any day.

Changed in firefox:
importance: Unknown → Medium
Revision history for this message
In , Shawn Wilsher (sdwilsh) wrote :

Comment on attachment 459993
Preliminary patch

We should just take this as a stopgap. Not sure when someone will have cycles to look into how we could ever get into this state.

Revision history for this message
In , Bzbarsky (bzbarsky) wrote :
Changed in firefox:
status: Confirmed → Fix Released
Micah Gersten (micahg)
Changed in firefox:
milestone: none → 4.0b8
Revision history for this message
In , Daniel Richard G. (skunk) wrote :

Can the patch be pushed to 3.6.x? (Just noticed it's not in 3.6.13.)

Revision history for this message
In , Dveditz (dveditz) wrote :

Comment on attachment 459993
Preliminary patch

Approved for 1.9.2.14, a=dveditz for release-drivers

Revision history for this message
In , Bzbarsky (bzbarsky) wrote :
Revision history for this message
In , Abillings (abillings) wrote :

From the context, there seems to be no real way for me, as QA, to verify this fix. I don't have the correct setup to reproduce the problem. Daniel, can you try using a nightly 1.9.2 firefox build from http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-1.9.2/ and seeing if the build fixes the issue?

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

Al: I'm presuming the ideal is to reproduce a crash with 3.6.13, and observe the lack of one with 3.6.14pre. However, whichever of the two I install, the auto-update installs the new 3.6.14pre nightly---which may be affecting the behavior of the bug. How should I go about testing this?

Revision history for this message
In , Abillings (abillings) wrote :

The autoupdate shouldn't be triggered for many hours. If you download a clean 3.6.13 released build from the website and run it, it shouldn't autoupdate very soon. You can then download the current 3.6.14pre build and try it there.

The other option is to use the profilemanager to set up a different profile for this but you shouldn't need to do so.

Revision history for this message
In , daviddahl (ddahl) wrote :

(In reply to comment #24)
> Al: I'm presuming the ideal is to reproduce a crash with 3.6.13, and observe
> the lack of one with 3.6.14pre. However, whichever of the two I install, the
> auto-update installs the new 3.6.14pre nightly---which may be affecting the
> behavior of the bug. How should I go about testing this?

You can disable auto-update in about:config. set 'app.update.auto' to false

Revision history for this message
In , Daniel Richard G. (skunk) wrote :

Okay, I've been testing Namoroka 3.6.13 versus 3.6.14pre (post-dating this fix), and I'm afraid to say the results are less than conclusive.

* I get crashes with 3.6.13, although not as frequently as with Ubuntu's 3.6.13;

* I get crashes with 3.6.14pre, also not as frequently, though these are due to SIGBUS (seemingly stemming from Flash), which is a different failure mode from this bug;

* Ubuntu 3.6.13 plus my patch is the only build that has proven uncrashable so far;

* Because the Namoroka builds are optimized, there's not much I can do in the way of post-mortem to see what's going on.

All of these are with the same ~/.mozilla directory, and thus the same runtime configuration.

Wish I could report a before-vs.-after smoking gun, but the reality is more of a muddle.

Revision history for this message
In , Shawn Wilsher (sdwilsh) wrote :

(In reply to comment #27)
> * Because the Namoroka builds are optimized, there's not much I can do in the
> way of post-mortem to see what's going on.
We have symbols available, but I'm not really sure how to get them and use them on linux...

Revision history for this message
In , Ted Mielczarek (ted-mielczarek) wrote :
Revision history for this message
In , Daniel Richard G. (skunk) wrote :

It's not just symbol names (which were easily obtainable for the Ubuntu build), but also the stack trace. In debugging the original bug, I got to nsUrlClassifierDBServer only with a debug/unoptimized build; the optimized build + debug symbols yielded nothing.

Revision history for this message
Daniel Richard G. (skunk) wrote :

Laying this one to rest. Thank you Mozilla dev team.

Changed in firefox (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.