10.0 is breaking tgt autopkgtest

Bug #1951126 reported by Christian Ehrhardt 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
glusterfs (Debian)
Fix Released
Unknown
glusterfs (Ubuntu)
Fix Released
Undecided
Unassigned
tgt (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

As seen here
https://autopkgtest.ubuntu.com/results/autopkgtest-jammy/jammy/amd64/t/tgt/20211116_095213_6e5fe@/log.gz

The log messages are misleading at best, when reproducing in autopkgtest
in a local VM I've found:

$ sudo /usr/sbin/tgtd -f
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler
src/tcmalloc.cc:333] Attempt to free invalid pointer 0x55abdf028b00
Aborted

It is only glusterfs, as this resolves the issue:
$ sudo apt remove tgt-glusterfs
$ sudo /usr/sbin/tgtd -f
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler
tgtd: bs_init(387) use signalfd notification

This sounds very much like
  https://bugs.launchpad.net/ubuntu/+source/glusterfs/+bug/1950777
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=999700

TODO:
- #1 does a rebuild of tgt against the new glusterfs fix the issue?
- #2 does a rebuild of glusterfs with tcmalloc disabled on x86 fix the issue?

Changed in glusterfs (Debian):
status: Unknown → New
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Tests:

#1 rebuild of tgt against the new glusterfs
Still failing:
$ sudo /usr/sbin/tgtd -f
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler
src/tcmalloc.cc:333] Attempt to free invalid pointer 0x55f3f061eac0
Aborted

#2 Rebuild of glusterfs with tcmalloc disabled on x86

This makes it work again.

ubuntu@autopkgtest:~$ systemctl status tgt
● tgt.service - (i)SCSI target daemon
     Loaded: loaded (/lib/systemd/system/tgt.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2021-11-17 08:13:54 CET; 2s ago
       Docs: man:tgtd(8)
    Process: 4595 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys --name State -v offline (code=exited, status=0/SUCCESS)
    Process: 4596 ExecStartPost=/usr/sbin/tgt-admin -e -c /etc/tgt/targets.conf (code=exited, status=0/SUCCESS)
    Process: 4601 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys --name State -v ready (code=exited, status=0/SUCCESS)
   Main PID: 4594 (tgtd)
     Status: "Starting event loop..."
      Tasks: 1
     Memory: 3.9M
        CPU: 69ms
     CGroup: /system.slice/tgt.service
             └─4594 /usr/sbin/tgtd -f

Nov 17 08:13:54 autopkgtest systemd[1]: Starting (i)SCSI target daemon...
Nov 17 08:13:54 autopkgtest tgtd[4594]: tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
Nov 17 08:13:54 autopkgtest tgtd[4594]: tgtd: work_timer_start(146) use timer_fd based scheduler
Nov 17 08:13:54 autopkgtest tgtd[4594]: tgtd: bs_init(387) use signalfd notification
Nov 17 08:13:54 autopkgtest systemd[1]: Started (i)SCSI target daemon.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Interestingly, after fixing it once (e.g. by installing glusterfs without tcmalloc) it does continue to work even when going back to the old version of glusterfs.

I've tried to purge and reinstall but it still works. As if only some init-once function is the trigger that makes it enter the bad path and no more happening now.

I've found the condition eventually, only when multiple backing tgt stores are installed.
For example tgt-rbd + tgt-glusterfs then the crash happens.
-rw-r--r-- 1 root root 18560 Feb 24 2021 bs_glfs.so
-rw-r--r-- 1 root root 22656 Feb 24 2021 bs_rbd.so

Quite likely those backing store modules trigger some init phase that - when run twice, but with tcmalloc present (on one of them) - crash. tgt-glusterfs alone does not fail "as hard" but still doesn't work.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Turns out tcmalloc does overwrite symbols and doing so from anything but the initial load on the main app will cause issues. Here tgt loads things normally, to then (once glusterfs is loaded) get partially re-mapped.

I've found many similar issues:
- https://github.com/gperftools/gperftools/issues/1066
- https://github.com/kcat/openal-soft/issues/134
- https://tracker.ceph.com/issues/23653
- https://bugzilla.redhat.com/show_bug.cgi?id=1569391

If I LD_PRELOAD tcmalloc right at the beginning of tgt

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Debugging confirms this, tgt uses dlopen to load its optional storage modules and from there into glusterfs the it crashes

(gdb) frame 20
#20 0x0000555555596d3c in bs_init_signalfd () at bs.c:286
(gdb) p soname
$1 = 0x555555dbe150 "/usr/lib/tgt/backing-store/bs_glfs.so"
(gdb) l

This is slightly more tricky in this case, as there are two conditions that can break it:
1. loading tcmalloc late conflicting with existing code (that was LP: #1950777 Debian #999700
2. But on x86 in the cases we had #1 didn't happen, yet there is another problem and that is when multiple libs want to insert tcmalloc (which happens here)

Ceph (via tgt-rbd) already can use tcmalloc and tgt is ready to tolerate that.
But not ALSO glusterfs uses tcmalloc.
So effectively tgt does dlopen on its libs which now both pull in tcmalloc.

We can see that in GDB
(gdb) b dlopen
Breakpoint 1 at 0x14620
(gdb) run -f
Starting program: /usr/sbin/tgtd -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_glfs.so", mode=mode@entry=2) at dlopen.c:77
77 dlopen.c: No such file or directory.
(gdb) c
Continuing.

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_rbd.so", mode=mode@entry=2) at dlopen.c:77
77 in dlopen.c
(gdb) c
Continuing.
src/tcmalloc.cc:333] Attempt to free invalid pointer 0x5555555ecae0

And since this is name-based ordering we can turn the order around:

$ mv /usr/lib/tgt/backing-store/bs_glfs.so /usr/lib/tgt/backing-store/bs_xxglfs.so

And now:

(gdb) b dlopen
Breakpoint 1 at 0x14620
(gdb) run -f
Starting program: /usr/sbin/tgtd -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
tgtd: iser_ib_init(3431) Failed to initialize RDMA; load kernel modules?
tgtd: work_timer_start(146) use timer_fd based scheduler

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_xxglfs.so", mode=mode@entry=2) at dlopen.c:77
77 dlopen.c: No such file or directory.
(gdb) c
Continuing.

Breakpoint 1, ___dlopen (file=0x5555555be4a0 "/usr/lib/tgt/backing-store/bs_rbd.so", mode=mode@entry=2) at dlopen.c:77
77 in dlopen.c
(gdb) c
Continuing.
src/tcmalloc.cc:333] Attempt to free invalid pointer 0x5555555ecae0

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

We could say "make tgt-rbd and tgt-glusterfs mutually exclusive" and that would fix the immediate issue, but TBH there might be many other cases which just do not happen to have an autopkgtest triggering it.
For now - until someone has a better idea how to prevent this kind of issues - all we can do is disable tcmalloc in glusterfs.
That will make 10.0 less of an interrupting upload and we can enable it later once we have a better approach to prevent such issues.

TL;DR - using tcmalloc in a lib (like libgf* of glusterfs) is calling for issues as anyone loading it late or multiple of them will likely crash the program.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI: Updated the Debian bug and uploaded a fix there, this will sync to Jammy automatically.

Changed in glusterfs (Debian):
status: New → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glusterfs - 10.0-1.2ubuntu1

---------------
glusterfs (10.0-1.2ubuntu1) jammy; urgency=medium

  * d/rules: add -DUATOMIC_NO_LINK_ERROR to CFLAGS to fix armhf build
    (LP: #1951408)

 -- Andreas Hasenack <email address hidden> Thu, 25 Nov 2021 10:38:14 -0300

Changed in glusterfs (Ubuntu):
status: New → Fix Released
Changed in tgt (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.