named crashes on REQUIRE((disp->attributes assert

Bug #1833400 reported by Heikki Hannikainen
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bind9 (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
High
Unassigned

Bug Description

[Impact]

 * A race in the handling of the dispatcher can trigger a crash.
   The reason is an assertion of a case that can actually happen (rarely
   but it can)

 * The fix is very small and essentially converts the assert into an early
   return here a quote of the added comment:
     If the attribute DNS_DISPATCHATTR_NOLISTEN is not set, then
     the dispatch is already handling a recv; return immediately.

[Test Case]

 * That is the hardest part on this SRU, this is a race and neither in the
   upstream bug [1] nor here someone was able to come up with clear repro
   steps. I'm afraid we might just review code and probably keep it in
   proposed some extra time?

[Regression Potential]

 * The change is minimal and upstream (as well as in Ubuntu releases) for
   quite some time now. So I'm confident it isn't entirely broken.
   The old code was preventing an odd condition to happen, the new code
   still does only instead of an aborting assert it now is an early
   return.
   The regressions I could think of are only theoretical - like someone
   having a test for this and now wondering it works - not really an
   issue. No really the only issue I can think of is if that early return
   on the return path would trigger a bug as it e.g. can't handle the
   returned null properly. But TBH that would replace one crash (the
   current one) with another one, so it isn't that bad.

[Other Info]

 * This isn't very frequent at least to the crash DB [2] (others are :-/)
   but at least this one has a clearly outlined solution.

[1]: https://bugs.isc.org/Public/Bug/Display.html?id=43822
[2]: https://errors.ubuntu.com/?release=Ubuntu%2016.04&package=bind9&from=2016-01-01&to=2019-07-31

---

Ubuntu xenial 16.04, bind9 1:9.10.3.dfsg.P4-8ubuntu1.14

Yesterday the named process started crashing frequently, 49 crashes so far on 49 different servers around the world (one crash each!). We did run OS upgrades yesterday, but bind9 packages were not updated at this time. This particular bind9 package version was mostly deployed out last month. Due to the sudden surge of crashes and the distribution I'm suspecting this might be triggered remotely by an incoming packet.

Backtrace from the assert:

2019-06-18T21:42:16.801421+00:00 hostname named[888]: general: critical: ../../../lib/dns/dispatch.c:3691: REQUIRE((disp->attributes & 0x00000020U) != 0) failed, back trace
2019-06-18T21:42:16.801890+00:00 hostname named[888]: general: critical: #0 0x555c41aeeaf0 in ??
2019-06-18T21:42:16.802118+00:00 hostname named[888]: general: critical: #1 0x7f475bd66eaa in ??
2019-06-18T21:42:16.802315+00:00 hostname named[888]: general: critical: #2 0x7f475ca9f7da in ??
2019-06-18T21:42:16.802496+00:00 hostname named[888]: general: critical: #3 0x555c41ae3195 in ??
2019-06-18T21:42:16.802684+00:00 hostname named[888]: general: critical: #4 0x7f475bd8b420 in ??
2019-06-18T21:42:16.802875+00:00 hostname named[888]: general: critical: #5 0x7f475b7346ba in ??
2019-06-18T21:42:16.803056+00:00 hostname named[888]: general: critical: #6 0x7f475ae7e41d in ??
2019-06-18T21:42:16.803245+00:00 hostname named[888]: general: critical: exiting (due to assertion failure)

Related branches

Revision history for this message
Paride Legovini (paride) wrote :

Thanks for your report. Did named crash exactly once per server, without any further crashes after the service was restarted?

Do have a list of the packages that got updated in the upgrade you performed before the crashes happened? Is there a difference in the upgraded packageset between the servers where named crashes and those where it didn't?

At the moment I don't really have enough elements to tell anything, but I'd try to understand if one of the upgraded packages is something named depends on (e.g. a shared library), and a if restart of the service should have been triggered by the upgrade

I'm marking this report as Incomplete for now, which is our way to mark bugs for which we asked for more information. Once you provided it please change the status back to New, and we'll look at it again. Thank you!

Changed in bind9 (Ubuntu):
status: New → Incomplete
Revision history for this message
Douglas Hall (tescruni) wrote :

One of my Ubuntu DNS servers had a seemingly identical issue last week. Ubuntu 16.04.3 LTS, BIND 9.10.3-P4-Ubuntu <id:ebd72b3>, but no recent upgrades or changes were made on the server.

Jul 25 10:39:01 cassini named[1378]: ../../../lib/dns/dispatch.c:3691: REQUIRE((disp->attributes & 0x00000020U) != 0) failed, back trace
Jul 25 10:39:01 cassini named[1378]: #0 0x55c1c55f18b0 in ??
Jul 25 10:39:01 cassini named[1378]: #1 0x7fbd8fbc0e7a in ??
Jul 25 10:39:01 cassini named[1378]: #2 0x7fbd908f87aa in ??
Jul 25 10:39:01 cassini named[1378]: #3 0x55c1c55e5fd5 in ??
Jul 25 10:39:01 cassini named[1378]: #4 0x7fbd8fbe5360 in ??
Jul 25 10:39:01 cassini named[1378]: #5 0x7fbd8f58e6ba in ??
Jul 25 10:39:01 cassini named[1378]: #6 0x7fbd8ecd841d in ??
Jul 25 10:39:01 cassini named[1378]: exiting (due to assertion failure)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Odd, I've seen really (I mean really) old reports with the same signature [1]
The attribute [2] seems to be DNS_DISPATCHATTR_NOLISTEN - anything in that regard in your configs?

Code seems to be this:
3685 void
3686 dns_dispatch_importrecv(dns_dispatch_t *disp, isc_event_t *event) {
3687 »···void *buf;
3688 »···isc_socketevent_t *sevent, *newsevent;
3689
3690 »···REQUIRE(VALID_DISPATCH(disp));
3691 »···REQUIRE((disp->attributes & DNS_DISPATCHATTR_NOLISTEN) != 0);
3692 »···REQUIRE(event != NULL);

18.04 is quite different and less fatal for the same condition
3718 void
3719 dns_dispatch_importrecv(dns_dispatch_t *disp, isc_event_t *event) {
3720 »···void *buf;
3721 »···isc_socketevent_t *sevent, *newsevent;
3722
3723 »···REQUIRE(VALID_DISPATCH(disp));
3724 »···REQUIRE(event != NULL);
3725
3726 »···if ((disp->attributes & DNS_DISPATCHATTR_NOLISTEN) == 0)
3727 »···»···return;

Thanks to git I was able to find changes [3][4] which seem to fix this issue.
This lead to the issue [5] which I can't read for permissions in their bug system :-/

Since the expected Fix is in Bionic (sine 9.10.6 [6] to be specific) I'll mark only Xenial as affected for now.

I had hoped that the bug might have instructions to recreate the issue.

@Douglas / Heikki - do you have any means to trigger this bug so that we could verify with a potential fix backport with a PPA ?

@Douglas / Heikki - it mentions that this is a race on shutdown, was your server restarted around that time?

[1]: https://sourceforge.net/p/bind-dlz/mailman/message/6537634/
[2]: http://users.isc.org/~each/doxygen/bind9/dispatch_8h.html#73469a6ec10db29033bb0da2d8acb31c
[3]: https://gitlab.isc.org/isc-projects/bind9/commit/019132b70c368bc9abca0034d07b324bb7cb6eb2
[4]: https://gitlab.isc.org/isc-projects/bind9/commit/a94d68ce432b9e11c4ae91d48ee257b1675f86d7
[5]: https://bugs.isc.org/Public/Bug/Display.html?id=43822
[6]: https://abi-laboratory.pro/?view=changelog&l=bind&v=9.10.6

Changed in bind9 (Ubuntu):
status: Incomplete → Fix Released
Changed in bind9 (Ubuntu Xenial):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I have asked around but found no one with access to the bug so far (as I was hoping for some repro steps there). I'll create a PPA for testing, but without sort of clear repro steps it will be hard to process the SRU for an update.

Hence I'm setting this incomplete for more information on how to verify/trigger this come up - which I hope the reporter/affected people might contribute here.

And since all we have from the original description (until we have further info) is "race on shutdown" I think the severity on this is rather low.

If one has that and wants to try the PPA [2] should have what is the assumed fix.

And to help that anyone can continue on this (have all the changes) I opened an MP [3] which allows others to use the branch to continue here.

[1]: https://wiki.ubuntu.com/StableReleaseUpdates
[2]: https://launchpad.net/~paelzer/+archive/ubuntu/bug-1833400-bind-crash
[3]: https://code.launchpad.net/~paelzer/ubuntu/+source/bind9/+git/bind9/+merge/370942

Changed in bind9 (Ubuntu Xenial):
importance: Undecided → Low
Revision history for this message
Heikki Hannikainen (hessu) wrote :

Hello, as I wrote in my original ticket description above, bind9 packages were not updated at this time; the crash did *not* happen on shutdown.

However, the named process on each server crashed exactly once. There were 0 crashes in July or August. So, I'm afraid, I can't reproduce this.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Heikki,
the public bug now also is open, no repro steps there either.

But I have rethought this - while this is a case without clear repro steps I think it is a bug that we can fix to help users. The severity rating of the upstream bug (now that we can see it) also goes that way.
On the good side the change is really small and therefore reviewable - so I thnik we can should and go on with the SRU on this.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Andreas also reviewed the MP and I'll prep the SRU here now.

@Heikki - I realized the shutdown that was mentioned is not the whole bind9, but a dispatcher. So it can happen on "lesser" restarts in the bind9 lifecycle.

Changed in bind9 (Ubuntu Xenial):
importance: Low → High
description: updated
description: updated
Changed in bind9 (Ubuntu Xenial):
status: Incomplete → Triaged
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Uploaded to -unapproved for SRU Team review

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Heikki, or anyone else affected,

Accepted bind9 into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/bind9/1:9.10.3.dfsg.P4-8ubuntu1.15 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in bind9 (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-xenial
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I was testing the package in Xenial a bit in general as we have no explicit crash steps as outlined in the SRU template. It worked fine for me through an upgrade and with some simple name resolutions using the bind9 tools like bind9-host through the local named.

Setting verified, but given that we lack an explicit test I'd not mind if we keep this in -proposed a bit linger than usual before releasing it just to give things an extra chance to be spotted.

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bind9 - 1:9.10.3.dfsg.P4-8ubuntu1.15

---------------
bind9 (1:9.10.3.dfsg.P4-8ubuntu1.15) xenial; urgency=medium

  * d/p/ubuntu//lp-1833400*: fix race on shutdown (LP: #1833400)
  * d/p/fix-shutdown-race.diff: dig/host/nslookup could crash when interrupted
    close to a query timeout (LP: #1797926)

 -- Christian Ehrhardt <email address hidden> Mon, 05 Aug 2019 07:30:49 +0200

Changed in bind9 (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for bind9 has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.