Reproducible hang in generic/430 with xfstest from upstream

Bug #1755999 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Confirmed
Undecided
Unassigned
linux (Ubuntu)
Triaged
Medium
Unassigned
Bionic
New
Undecided
Unassigned
xfsprogs (Ubuntu)
New
Undecided
Unassigned
Bionic
New
Undecided
Unassigned

Bug Description

While testing the latest xfstest from upstream, the generic/430 test will hang, no matter using ext4/xfs/btrfs.

It looks like this issue was caused by the following command to copy beyond the file:
/usr/sbin/xfs_io -i -f -c "copy_range -s 4000 -l 2000 /home/ubuntu/test/test-430/file" "/home/ubuntu/test/test-430/beyond"

The copied file will have a correct MD5 as expected.
e68d4a150c4e42f4f9ea3ffe4c9cf4ed beyond

But the command will never return.

The file size of test-430/file is 5000, so a copy_range call with source offset 4000 with length 1000 works, but > 1000 does not.

Steps:
 1. Deploy a node with Bionic (should have a /dev/sdb available for the test)
 2. Run:
    sudo apt-get install git python-minimal -y
    git clone --depth=1 https://github.com/Cypresslin/autotest-client-tests.git -b kteam-xfstest-upstream
    git clone --depth=1 git://kernel.ubuntu.com/ubuntu/autotest
    rm -fr autotest/client/tests
    ln -sf ~/autotest-client-tests autotest/client/tests
 3. Run the test with the following command:
    AUTOTEST_PATH=/home/ubuntu/autotest sudo -E autotest/client/autotest-local --verbose autotest/client/tests/

(The test suite can be built manually, but it's easier to do this with autotest framework)

To run this test solely after the test partition has been creation on /dev/sdb:
    mkdir /home/ubuntu/test
    cd autotest/client/tmp/xfstests/src/xfstests-bld/xfstests-dev
    sudo su
    export TEST_DIR=/home/ubuntu/test
    export TEST_DEV=/dev/sdb1
    ./check generic/430

Tested with the latest mainline kernel, 4.16.0-041600rc5-generic, and the bug still exist.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-10-generic 4.15.0-10.11 [modified: boot/vmlinuz-4.15.0-10-generic]
ProcVersionSignature: User Name 4.15.0-10.11-generic 4.15.3
Uname: Linux 4.15.0-10-generic x86_64
ApportVersion: 2.20.8-0ubuntu10
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 1148 F.... pulseaudio
 /dev/snd/controlC1: ubuntu 1148 F.... pulseaudio
Date: Thu Mar 15 14:24:46 2018
InstallationDate: Installed on 2018-03-15 (0 days ago)
InstallationMedia: Ubuntu 18.04 LTS "Bionic Beaver" - Alpha amd64 (20180228)
MachineType: Dell Inc. Dell Precision M3800
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-10-generic.efi.signed root=UUID=d1980d27-9063-4d92-aa10-1fb240453d8d ro quiet splash vt.handoff=1
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-10-generic N/A
 linux-backports-modules-4.15.0-10-generic N/A
 linux-firmware 1.172
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 10/14/2014
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A07
dmi.board.name: Dell Precision M3800
dmi.board.vendor: Dell Inc.
dmi.board.version: A07
dmi.chassis.type: 8
dmi.chassis.vendor: Dell Inc.
dmi.chassis.version: Not Specified
dmi.modalias: dmi:bvnDellInc.:bvrA07:bd10/14/2014:svnDellInc.:pnDellPrecisionM3800:pvrA07:rvnDellInc.:rnDellPrecisionM3800:rvrA07:cvnDellInc.:ct8:cvrNotSpecified:
dmi.product.name: Dell Precision M3800
dmi.product.version: A07
dmi.sys.vendor: Dell Inc.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Po-Hsu Lin (cypressyew) wrote : Re: Reproducible hang in ext4 generic/430 with xfstest from upstream

A quick and dirty gdb debug indicates it's looping with copy_file_range syscall(326)

$ grep copy /usr/include/asm/unistd_64.h
#define __NR_copy_file_range 326

gdb --args /usr/sbin/xfs_io -i -f -c "copy_range -s 4000 -l 2000 /home/ubuntu/test/test-430/file" "/home/ubuntu/test/test-430/beyond"

(gdb) catch syscall 326
Catchpoint 1 (syscall 326)
(gdb) run
Starting program: /usr/sbin/xfs_io -i -f -c copy_range\ -s\ 4000\ -l\ 2000\ /home/ubuntu/test/test-430/file /home/ubuntu/test/test-430/beyond
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff7036700 (LWP 10450)]

Thread 1 "xfs_io" hit Catchpoint 1 (call to syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.
(gdb) continue
Continuing.

Thread 1 "xfs_io" hit Catchpoint 1 (returned from syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) continue
Continuing.

Thread 1 "xfs_io" hit Catchpoint 1 (call to syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) continue
Continuing.

Thread 1 "xfs_io" hit Catchpoint 1 (returned from syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) continue
Continuing.

Thread 1 "xfs_io" hit Catchpoint 1 (call to syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) continue
Continuing.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

There is a similar bug report [1] for NFS, and the fix has already landed in Bionic kernel [2]

It's not the same issue as the copy_range works except the "beyond" copy test here.

[1] https://www.spinics.net/lists/linux-nfs/msg63817.html
[2] 6d3b5d8d8dd1c14f991ccab84b40f8425f1ae91b in Bionic tree

Po-Hsu Lin (cypressyew)
summary: - Reproducible hang in ext4 generic/430 with xfstest from upstream
+ Reproducible hang in generic/430 with xfstest from upstream
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Mainline kernel bisect shows that this stuck occurs between 4.9.87 and 4.10rc1

With 4.9.87 this test failed to copy the file, but it won't get stuck.

# export TEST_DIR=/home/ubuntu/test ;export TEST_DEV=/dev/sdb1 ; ./check generic/430
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 M3800 4.9.87-040987-generic

generic/430 - output mismatch (see /home/ubuntu/autotest/client/tmp/xfstests/src/xfstests-bld/xfstests-dev/results//generic/430.out.bad)
    --- tests/generic/430.out 2018-03-15 12:26:40.285762490 +0800
    +++ /home/ubuntu/autotest/client/tmp/xfstests/src/xfstests-bld/xfstests-dev/results//generic/430.out.bad 2018-03-15 19:13:36.691401239 +0800
    @@ -4,22 +4,27 @@
     e11fbace556cba26bf0076e74cab90a3 TEST_DIR/test-430/file
     e11fbace556cba26bf0076e74cab90a3 TEST_DIR/test-430/copy
     Copy beginning of original file
    +cmp: EOF on /home/ubuntu/test/test-430/beginning which is empty
     md5sums after copying beginning:
     e11fbace556cba26bf0076e74cab90a3 TEST_DIR/test-430/file
    -cabe45dcc9ae5b66ba86600cca6b8ba8 TEST_DIR/test-430/beginning
    ...
    (Run 'diff -u tests/generic/430.out /home/ubuntu/autotest/client/tmp/xfstests/src/xfstests-bld/xfstests-dev/results//generic/430.out.bad' to see the entire diff)
Ran: generic/430
Failures: generic/430
Failed 1 of 1 tests

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
tags: added: kernel-da-key
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This issue has gone with xfsprogs 4.15.1

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Changed in ubuntu-kernel-tests:
status: New → Confirmed
Po-Hsu Lin (cypressyew)
tags: added: xfstests
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This issue does not exist in D AMD64
    generic/430 2s

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

BTW this hang can be found on Bionic 5.0 kernel, on ext4 / btrfs / xfs.
So it might has something to do with the userspace tools as well.

Po-Hsu Lin (cypressyew)
tags: added: ubuntu-xfstests-btrfs ubuntu-xfstests-ext4 ubuntu-xfstests-xfs
Sean Feole (sfeole)
tags: added: sru-20200127
tags: added: 4.15 5.0
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Passed with Focal 5.4 (5.4.0-31.35), with just 1 second to run:
generic/430 1s

Revision history for this message
lilideng (lilideng) wrote :

generic/430 still hung on ubuntu 18.04 on azure, kernel version is 5.4.0-1055-azure, file system type is xfs.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.