Bug #1951126 “10.0 is breaking tgt autopkgtest” : Bugs : glusterfs package : Ubuntu

Bug Watch Updater (bug-watch-updater) on 2021-11-16

Changed in glusterfs (Debian):
status:	Unknown → New

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2021-11-17:

#1

Tests:

#1 rebuild of tgt against the new glusterfs
Still failing:
$ sudo /usr/sbin/tgtd -f
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler
src/tcmalloc.cc:333] Attempt to free invalid pointer 0x55f3f061eac0
Aborted

#2 Rebuild of glusterfs with tcmalloc disabled on x86

This makes it work again.

ubuntu@autopkgtest:~$ systemctl status tgt
● tgt.service - (i)SCSI target daemon
     Loaded: loaded (/lib/systemd/system/tgt.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-11-17 08:13:54 CET; 2s ago
       Docs: man:tgtd(8)
    Process: 4595 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys --name State -v offline (code=exited, status=0/SUCCESS)
    Process: 4596 ExecStartPost=/usr/sbin/tgt-admin -e -c /etc/tgt/targets.conf (code=exited, status=0/SUCCESS)
    Process: 4601 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys --name State -v ready (code=exited, status=0/SUCCESS)
   Main PID: 4594 (tgtd)
     Status: "Starting event loop..."
      Tasks: 1
     Memory: 3.9M
        CPU: 69ms
     CGroup: /system.slice/tgt.service
             └─4594 /usr/sbin/tgtd -f

Nov 17 08:13:54 autopkgtest systemd[1]: Starting (i)SCSI target daemon...
Nov 17 08:13:54 autopkgtest tgtd[4594]: tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
Nov 17 08:13:54 autopkgtest tgtd[4594]: tgtd: work_timer_start(146) use timer_fd based scheduler
Nov 17 08:13:54 autopkgtest tgtd[4594]: tgtd: bs_init(387) use signalfd notification
Nov 17 08:13:54 autopkgtest systemd[1]: Started (i)SCSI target daemon.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2021-11-17:

#2

Interestingly, after fixing it once (e.g. by installing glusterfs without tcmalloc) it does continue to work even when going back to the old version of glusterfs.

I've tried to purge and reinstall but it still works. As if only some init-once function is the trigger that makes it enter the bad path and no more happening now.

I've found the condition eventually, only when multiple backing tgt stores are installed.
For example tgt-rbd + tgt-glusterfs then the crash happens.
-rw-r--r-- 1 root root 18560 Feb 24 2021 bs_glfs.so
-rw-r--r-- 1 root root 22656 Feb 24 2021 bs_rbd.so

Quite likely those backing store modules trigger some init phase that - when run twice, but with tcmalloc present (on one of them) - crash. tgt-glusterfs alone does not fail "as hard" but still doesn't work.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2021-11-17:

#3

Turns out tcmalloc does overwrite symbols and doing so from anything but the initial load on the main app will cause issues. Here tgt loads things normally, to then (once glusterfs is loaded) get partially re-mapped.

I've found many similar issues:
- https://github.com/gperftools/gperftools/issues/1066
- https://github.com/kcat/openal-soft/issues/134
- https://tracker.ceph.com/issues/23653
- https://bugzilla.redhat.com/show_bug.cgi?id=1569391

If I LD_PRELOAD tcmalloc right at the beginning of tgt

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2021-11-17:

#4

Debugging confirms this, tgt uses dlopen to load its optional storage modules and from there into glusterfs the it crashes

(gdb) frame 20
#20 0x0000555555596d3c in bs_init_signalfd () at bs.c:286
(gdb) p soname
$1 = 0x555555dbe150 "/usr/lib/tgt/backing-store/bs_glfs.so"
(gdb) l

This is slightly more tricky in this case, as there are two conditions that can break it:
1. loading tcmalloc late conflicting with existing code (that was LP: #1950777 Debian #999700
2. But on x86 in the cases we had #1 didn't happen, yet there is another problem and that is when multiple libs want to insert tcmalloc (which happens here)

Ceph (via tgt-rbd) already can use tcmalloc and tgt is ready to tolerate that.
But not ALSO glusterfs uses tcmalloc.
So effectively tgt does dlopen on its libs which now both pull in tcmalloc.

We can see that in GDB
(gdb) b dlopen
Breakpoint 1 at 0x14620
(gdb) run -f
Starting program: /usr/sbin/tgtd -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_glfs.so", mode=mode@entry=2) at dlopen.c:77
77 dlopen.c: No such file or directory.
(gdb) c
Continuing.

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_rbd.so", mode=mode@entry=2) at dlopen.c:77
77 in dlopen.c
(gdb) c
Continuing.
src/tcmalloc.cc:333] Attempt to free invalid pointer 0x5555555ecae0

And since this is name-based ordering we can turn the order around:

$ mv /usr/lib/tgt/backing-store/bs_glfs.so /usr/lib/tgt/backing-store/bs_xxglfs.so

And now:

(gdb) b dlopen
Breakpoint 1 at 0x14620
(gdb) run -f
Starting program: /usr/sbin/tgtd -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_xxglfs.so", mode=mode@entry=2) at dlopen.c:77
77 dlopen.c: No such file or directory.
(gdb) c
Continuing.

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_rbd.so", mode=mode@entry=2) at dlopen.c:77
77 in dlopen.c
(gdb) c
Continuing.
src/tcmalloc.cc:333] Attempt to free invalid pointer 0x5555555ecae0

Debugging confirms this, tgt uses dlopen to load its optional storage modules and from there into glusterfs the it crashes

(gdb) frame 20
#20 0x0000555555596d3c in bs_init_signalfd () at bs.c:286
(gdb) p soname
$1 = 0x555555dbe150 "/usr/lib/tgt/backing-store/bs_glfs.so"
(gdb) l

This is slightly more tricky in this case, as there are two conditions that can break it:
1. loading tcmalloc late conflicting with existing code (that was LP: #1950777 Debian #999700
2. But on x86 in the cases we had #1 didn't happen, yet there is another problem and that is when multiple libs want to insert tcmalloc (which happens here)

Ceph (via tgt-rbd) already can use tcmalloc and tgt is ready to tolerate that.
But not ALSO glusterfs uses tcmalloc.
So effectively tgt does dlopen on its libs which now both pull in tcmalloc.

We can see that in GDB
(gdb) b dlopen
Breakpoint 1 at 0x14620
(gdb) run -f
Starting program: /usr/sbin/tgtd -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_glfs.so", mode=mode@entry=2) at dlopen.c:77
77	dlopen.c: No such file or directory.
(gdb) c
Continuing.

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_rbd.so", mode=mode@entry=2) at dlopen.c:77
77	in dlopen.c
(gdb) c
Continuing.
src/tcmalloc.cc:333] Attempt to free invalid pointer 0x5555555ecae0

And since this is name-based ordering we can turn the order around:

$ mv /usr/lib/tgt/backing-store/bs_glfs.so /usr/lib/tgt/backing-store/bs_xxglfs.so

And now:

(gdb) b dlopen
Breakpoint 1 at 0x14620
(gdb) run -f
Starting program: /usr/sbin/tgtd -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_xxglfs.so", mode=mode@entry=2) at dlopen.c:77
77	dlopen.c: No such file or directory.
(gdb) c
Continuing.

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_rbd.so", mode=mode@entry=2) at dlopen.c:77
77	in dlopen.c
(gdb) c
Continuing.
src/tcmalloc.cc:333] Attempt to free invalid pointer 0x5555555ecae0

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2021-11-17:

#5

We could say "make tgt-rbd and tgt-glusterfs mutually exclusive" and that would fix the immediate issue, but TBH there might be many other cases which just do not happen to have an autopkgtest triggering it.
For now - until someone has a better idea how to prevent this kind of issues - all we can do is disable tcmalloc in glusterfs.
That will make 10.0 less of an interrupting upload and we can enable it later once we have a better approach to prevent such issues.

TL;DR - using tcmalloc in a lib (like libgf* of glusterfs) is calling for issues as anyone loading it late or multiple of them will likely crash the program.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2021-11-17:

#6

FYI: Updated the Debian bug and uploaded a fix there, this will sync to Jammy automatically.

Bug Watch Updater (bug-watch-updater) on 2021-11-17

Changed in glusterfs (Debian):
status:	New → Fix Released

Revision history for this message

Launchpad Janitor (janitor) wrote on 2021-12-06:

#7

This bug was fixed in the package glusterfs - 10.0-1.2ubuntu1

---------------
glusterfs (10.0-1.2ubuntu1) jammy; urgency=medium

* d/rules: add -DUATOMIC_NO_LINK_ERROR to CFLAGS to fix armhf build
(LP: #1951408)

-- Andreas Hasenack <email address hidden> Thu, 25 Nov 2021 10:38:14 -0300

Changed in glusterfs (Ubuntu):
status:	New → Fix Released

Christian Ehrhardt  (paelzer) on 2023-03-24

Changed in tgt (Ubuntu):
status:	New → Fix Released

	Status	Importance	Assigned to
glusterfs (Debian)	Fix Released	Unknown	debbugs #999619
glusterfs (Ubuntu)	Fix Released	Undecided	Unassigned
tgt (Ubuntu)	Fix Released	Undecided	Unassigned

Ubuntu
glusterfs package

10.0 is breaking tgt autopkgtest

Bug Description

Other bug subscribers

Remote bug watches

Ubuntuglusterfs package

10.0 is breaking tgt autopkgtest

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
glusterfs package