Mir

Mir clients (including Unity8 itself) crash in XGetXCBConnection() if multiple versions of mir-client-platform-mesa are installed.

Bug #1526658 reported by Andreas Pokorny
72
This bug affects 11 people
Affects Status Importance Assigned to Milestone
Mir
Fix Released
High
Daniel van Vugt
mir (Ubuntu)
Fix Released
High
Unassigned

Bug Description

The mir egl platform fails to load under certain circumstances producing the following stack trace:

#0 0x00007ffff3a996e7 in XGetXCBConnection () from /usr/lib/x86_64-linux-gnu/libX11-xcb.so.1
#1 0x00007ffff534dc74 in dri2_initialize_x11_dri2 (drv=<optimized out>, disp=0x6d1b90) at ../../../../src/egl/drivers/dri2/platform_x11.c:1268
#2 dri2_initialize_x11 (drv=<optimized out>, disp=0x6d1b90) at ../../../../src/egl/drivers/dri2/platform_x11.c:1357
#3 0x00007ffff5347adf in _eglMatchAndInitialize (dpy=0x6d1b90) at ../../../../src/egl/main/egldriver.c:261
#4 0x00007ffff5347b99 in _eglMatchDriver (dpy=dpy@entry=0x6d1b90, test_only=test_only@entry=0) at ../../../../src/egl/main/egldriver.c:292
#5 0x00007ffff5343b32 in eglInitialize (dpy=0x6d1b90, major=0x7fffffffdac8, minor=0x7fffffffdacc) at ../../../../src/egl/main/eglapi.c:482
#6 0x00007ffff686cc1f in ?? () from /usr/lib/x86_64-linux-gnu/libmirserver.so.36
#7 0x00007ffff686cd7b in ?? () from /usr/lib/x86_64-linux-gnu/libmirserver.so.36
#8 0x00007ffff686d420 in ?? () from /usr/lib/x86_64-linux-gnu/libmirserver.so.36
#9 0x00007ffff59dd82c in mir::graphics::OverlappingOutputGrouping::for_each_group(std::function<void (mir::graphics::OverlappingOutputGroup const&)> const&) () from /usr/lib/x86_64-linux-gnu/libmirplatform.so.11
#10 0x00007ffff686de6b in ?? () from /usr/lib/x86_64-linux-gnu/libmirserver.so.36
#11 0x00007ffff686e450 in ?? () from /usr/lib/x86_64-linux-gnu/libmirserver.so.36
#12 0x00007ffff67fef3c in ?? () from /usr/lib/x86_64-linux-gnu/libmirserver.so.36
#13 0x00007ffff67ff87d in ?? () from /usr/lib/x86_64-linux-gnu/libmirserver.so.36
#14 0x00007ffff67fde31 in mir::DefaultServerConfiguration::the_display() () from /usr/lib/x86_64-linux-gnu/libmirserver.so.36

In the given case a 0.17.1 mirclient9 was combined with a 0.18 mirserver36. mirserver36 is compatible with mirclient9. The mirserver36 works with mirclient9. The significant difference lies in the client and server platforms. The native display used for egl initialization is created by the server. Mesa then uses a function from the client platform to validate the native display. For yet unknown reason the 0.17.1 libmirclient9 did select mesa.so.2 (from mir-client-platform-mesa2 0.14). Thus the validation failed.
It did so probably because that version was a left over of the 0.14 release with a partially bumped ABI. So the remaining issue is that there is an ABI we have to care about between the native EGL Display given to mesa, and the client platform that validates it.

Related branches

description: updated
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

What parameter is being passed to XGetXCBConnection() ?

It's not clear from the man page whether NULL is allowed (like XOpenDisplay), or what the error handling of XGetXCBConnection is meant to do in the absence of an X display.

Revision history for this message
Andreas Pokorny (andreas-pokorny) wrote :

Ok part of the trouble seems to be caused by a different version of the client platform.
eglGetDisplay calls out for the display validation function in mesa.so.2 instead of mesa.so.3
Breakpoint 1, 0x00007ffff0a22e60 in mir_client_mesa_egl_native_display_is_valid ()
   from /usr/lib/x86_64-linux-gnu/mir/client-platform/mesa.so.2

Revision history for this message
Andreas Pokorny (andreas-pokorny) wrote :

why do we even load mesa.so.2 it has the wrong abi stanza ..

summary: - eglInitialize crashes on kvm qxl with mir
+ mir may use incompatible client platform to validate server display
description: updated
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: mir may use incompatible client platform to validate server display

I think many of us will avoid this bug just because we don't have old packages installed:

$ find /usr -name mesa.so.\?
/usr/lib/x86_64-linux-gnu/mir/client-platform/mesa.so.3

Yeah dlvsym seems to be telling us it has found the right function and that function is version 3 in your mesa.so.2.

Exactly what binary package/version does your mesa.so.2 come from?

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Oooh, I see the problem:

Mir does not care what the client-platform file names are. It loads and searches them all, including your mesa.so.2

While the Mir code internally correctly uses dlvsym to restrict the symbol versions it's willing to use, the Mesa code does not check the symbol versions:

+#ifdef HAVE_MIR_PLATFORM
+static EGLBoolean
+_mir_display_is_valid(EGLNativeDisplayType nativeDisplay)
+{
+ typedef int (*MirEGLNativeDisplayIsValidFunc)(MirMesaEGLNativeDisplay*);
+
+ void *lib;
+ MirEGLNativeDisplayIsValidFunc general_check;
+ MirEGLNativeDisplayIsValidFunc client_check;
+ MirEGLNativeDisplayIsValidFunc server_check;
+ EGLBoolean is_valid = EGL_FALSE;
+
+ lib = dlopen(NULL, RTLD_LAZY);
+ if (lib == NULL)
+ return EGL_FALSE;
+
+ general_check = (MirEGLNativeDisplayIsValidFunc) dlsym(lib, "mir_egl_mesa_display_is_valid");
+ client_check = (MirEGLNativeDisplayIsValidFunc) dlsym(lib, "mir_client_mesa_egl_native_display_is_valid");
+ server_check = (MirEGLNativeDisplayIsValidFunc) dlsym(lib, "mir_server_mesa_egl_native_display_is_valid");
+

Mesa will find the first one in your address space. And that may well be mesa.so.2 it finds before mesa.so.3.

So we either need to:
  * avoid loading old modules according to file name, or
  * enhance egl-platform-mir.patch to to use dlvsym instead of dlsym, or
  * aggressively unload modules, ensuring they are not present in memory except while they're being probed.

I suspect the third option is best. I noticed a problem like that the other day anyway -- client processes seem to have all client modules loaded. We're leaking them I think.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

If we ensure mesa.so.2 is unloaded (since we should have correctly rejected it already) then Mesa won't accidentally use it via dlsym().

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

It's possible we recently started leaking a handle to unused client modules somewhere in Mir. So Mesa will call the wrong one.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Confirmed. Mir clients incorrectly still have all of the client-platform modules loaded while running. So of course Mesa is likely to pick the wrong function when it uses dlsym and calls into Mir.

Changed in mir:
importance: Undecided → High
status: New → Triaged
milestone: none → 0.19.0
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

OK, so this is partly due to the (fixed in 0.18) behaviour of the mesa module.

Prior to 0.18 it did a "dlopen(..., RTLD_NOW | RTLD_NOLOAD | RTLD_GLOBAL);" *when loaded* (not when selected).

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Also, I think it would be better if mesa tried to detect the mir library in process (something like dlopen(libmirclient, ...RTLD_NOLOAD)), instead of looking for a global symbol.

Revision history for this message
Andreas Pokorny (andreas-pokorny) wrote :

Added a explanation about why such an old client platform was used. Yes we should really make the mesa egl mir platform better.

description: updated
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

AFAICT, we only need to fix a module leak. Then there's only one module left (the correct one) to call.

It's obviously wrong that a client has all the client modules loaded for its lifetime...

I think Alan's comment #10 is a reasonable suggestion, but flawed. Because Mesa needs to load "mesa.so.3" to find the global symbol in question, not "libmirclient.so.9". So that would create a dependency in Mesa on the specific Mir driver version "mesa.so.3". And Mesa itself would have to change every time we broke the Mir client module ABI there. It's possible, but rebuilding Mesa and getting it into distro in time for a Mir release is a headache we don't need.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Leak now logged as bug 1527449. Fixing that, I suspect will fix this one too.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Bug confirmed. Just install the mir-client-platform-mesa2 deb from:
https://launchpad.net/ubuntu/+source/mir/0.14.0+15.10.20150723.1-0ubuntu1

And if you're working in a source tree, copy the mesa.so.2 to your client modules directory. Crash.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Bad news: This bug may be unfixable for systems with mesa.so.2 (Mir 0.14) and earlier packages.

Because they all contain this leak:
extern "C" int __attribute__((constructor))
ensure_loaded_with_rtld_global_mesa_client()
{
    Dl_info info;

    // Cast dladdr itself to work around g++-4.8 warnings (LP: #1366134)
    typedef int (safe_dladdr_t)(int(*func)(), Dl_info *info);
    safe_dladdr_t *safe_dladdr = (safe_dladdr_t*)&dladdr;
    safe_dladdr(&ensure_loaded_with_rtld_global_mesa_client, &info);
    dlopen(info.dli_fname, RTLD_NOW | RTLD_NOLOAD | RTLD_GLOBAL);
    return 0;
}

which means the very act of dlopening mesa.so.2 to probe it means it is forever leaked and stays resident (missing dlclose). I've fixed that leak in:
   https://code.launchpad.net/~vanvugt/mir/fix-1527449/+merge/281969
but we would need:
  * the same fix in all prior Mir releases retrospectively; or
  * Mir to never even try to dlopen *.so.2 ; or
  * Users to use the workaround of simply removing all old mir-client-platform-mesa* packages from their systems.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, I've re-tested with a hacked version of mesa.so.2 where the above leak is fixed and confirmed lp:~vanvugt/mir/fix-1527449 is indeed the fix for this bug.

Unfortunately it's only fixable for future Mir versions because we're not about to release patches for Mir 0.14 etc.

So if you do hit this bug in the mean time just use the workaround: Remove all old versions of the package "mir-client-platform-mesa*" from your system. In future when Mir 0.19 is the "old" version you have lingering on your machine along with newer versions, the bug won't exist any more.

Changed in mir:
assignee: nobody → Daniel van Vugt (vanvugt)
status: Triaged → In Progress
summary: - mir may use incompatible client platform to validate server display
+ Mir may use incompatible client platform to validate server display and
+ crash in XGetXCBConnection()
Changed in mir:
milestone: 0.19.0 → 0.20.0
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: Mir may use incompatible client platform to validate server display and crash in XGetXCBConnection()

Workaround:

$ sudo apt-get remove mir-client-platform-mesa2 mir-client-platform-mesa1

And restart.

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

Fix committed into lp:mir at revision None, scheduled for release in mir, milestone 0.20.0

Changed in mir:
status: In Progress → Fix Committed
Changed in mir:
milestone: 0.20.0 → 0.19.0
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mir (Ubuntu):
status: New → Confirmed
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

We now have a /future/ fix landed. But that only works for the cases where the older Mir version is 0.19 and you have something newer than that installed on top of it. Still lacking a retrospective fix, which may just need to be a script or deb rules to force out the old broken packages like: sudo apt-get remove mir-client-platform-mesa2 mir-client-platform-mesa1

summary: - Mir may use incompatible client platform to validate server display and
- crash in XGetXCBConnection()
+ Mir clients (including Unity8 itself) crash in XGetXCBConnection() if
+ multiple versions of mir-client-platform-mesa are installed.
Changed in mir (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

mir (0.19.0+16.04.20160128-0ubuntu1) xenial; urgency=medium

Changed in mir:
status: Fix Committed → Fix Released
Changed in mir (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Updated workaround for xenial users who now have Mir 0.19.0:

$ sudo apt-get remove mir-client-platform-mesa3 mir-client-platform-mesa2 mir-client-platform-mesa1

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

Fix committed into lp:mir at revision None, scheduled for release in mir, milestone 0.20.0

Changed in mir:
status: Fix Released → Fix Committed
Changed in mir:
status: Fix Committed → Fix Released
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Alan's enhanced fixed has now landed for Mir 0.20.0, which should eliminate the need for the above workaround. Also backported in preparation for Mir 0.19.1

no longer affects: mir/0.19
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.