Bug #480444 “packet storm with linux NFSv4 client when calling f...” : Bugs : linux package : Ubuntu

Revision history for this message

Steve Langasek (vorlon) wrote on 2009-11-11:

#1

network trace Edit (10.5 MiB, application/cap)

Changed in linux (Ubuntu):
importance:	Undecided → High

Leann Ogasawara (leannogasawara) on 2009-11-29

Changed in linux (Ubuntu):
status:	New → Triaged

Andy Whitcroft (apw) on 2009-11-30

tags:

added: kernel-series-unknown

Andy Whitcroft (apw) on 2009-11-30

tags:

added: kernel-karmic regression-release
removed: kernel-series-unknown

Leann Ogasawara (leannogasawara) on 2009-12-08

tags:

added: karmic
removed: kernel-karmic

Revision history for this message

Steve Langasek (vorlon) wrote on 2009-12-23:

#2

Problem is reproducible also with NFSv4 and 'krb5' auth instead of 'krb5i'. Not reproducible with nfsv3, either with or without krb5 auth.

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-02-22:

#3

After upgrading to 2.6.32-13-generic in lucid, a test suggests that this problem has been resolved. At least, 'bzr commit' no longer blows up over NFSv4.

Changed in linux (Ubuntu):
status:	Triaged → Fix Released

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-02-25:

#4

hmm, no, I just hadn't run the right commands yet. a 'bzr status' just reproduced the problem again in lucid, packet storm on ftruncate.

Changed in linux (Ubuntu):
status:	Fix Released → Triaged

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-10-23:

#5

(opening a task on the bzr package so this is findable via bug searches)

Changed in bzr (Ubuntu):
status:	New → Invalid
summary:	- packet storm with 2.6.31-15 NFSv4 client when calling ftruncate() + packet storm with linux NFSv4 client when calling ftruncate()

Revision history for this message

gpk (gpk-kochanski) wrote on 2010-10-23:

#6

I looked at the packet storm (captured above), and it seems to be a rapid repetition of a single
command and a single response.
The command is a COMPOUND of PUTFH, WRITE, GETATTR,
and the server responds with NFS4ERR_OPENMODE which is
in response to a failure of the WRITE.

Now, it is odd that there is a WRITE command as part of a ftruncate() syscall.
The WRITE carries 1024 bytest of data each time.
The offset (place where it is supposed to be written) is always fairly small, but different each time.

There seem to be bursts of many of the COMPOUNDS sometimes.

My packet storm is attached (wireshark .pcap file), and not quite the same, but similar.

I get COMPOUND call <EMPTY tag> PUTFH; LOCK
... PUTFH; WRITE; GETATTR
... RENEW
... PUTFH; LOCK
         (tries to set a WRITE_LT lock)
... PUTF; WRITE; GETATTR
         The write is always 543 bytes;
          The getattr asks for FATTR4_CHANGE and FATTR4_SIZE
                            and FATTR4_TIME_METADATA and FATTR4_TIME_MODIFY
... RENEW
... PUTFH; LOCK
... PUTFH; WRITE; GETATTR
... RENEW
et cetera
The "renew" calls succeed with NFS4_OK
PUTFH; LOCK succeeds with NFS4_OK (twice).
PUTFH; WRITE; GETATTR fails on the WRITE with NFS4ERR_OPENMODE

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-10-23: Re: [Bug 480444] Re: packet storm with linux NFSv4 client when calling ftruncate()

#7

On Sat, Oct 23, 2010 at 09:42:27PM -0000, gpk wrote:
> I looked at the packet storm (captured above), and it seems to be a rapid repetition of a single
> command and a single response.
> The command is a COMPOUND of PUTFH, WRITE, GETATTR,
> and the server responds with NFS4ERR_OPENMODE which is
> in response to a failure of the WRITE.

> Now, it is odd that there is a WRITE command as part of a ftruncate()
> syscall. The WRITE carries 1024 bytest of data each time. The offset
> (place where it is supposed to be written) is always fairly small, but
> different each time.

If NFSv4 doesn't implement an ftruncate primitive, I don't think it's
strange at all to implement this on top of a write - this is probably an
extending ftruncate() call, asking the filesystem to reserve x more bytes
for the file in a block (for performance reasons) past the current end of
the file.

So I guess we have two separate client bugs here: one is that the client
fails to handle the write failure in a graceful manner (instead it tries the
request again... "insanity is doing the same thing over and over again and
expecting different results"); the other is that the write request is
somehow buggy and fails to have the desired effect of extending the file.

Given the semantics of ftruncate() listed in the manpage:

If the file previously was shorter, it is extended, and the extended
part reads as null bytes ('\0').

and given that there doesn't appear to be an ftruncate primitive in NFSv4
that we can rely on, it would seem that we need to ask the NFS server to
write all zeroes, in the range filesize+1..length.

Cheers,
--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

gpk (gpk-kochanski) wrote on 2010-10-24: Re: [Bug 480444] Re: packet storm with linux NFSv4 client when calling ftruncate()

#8

On 24/10/10 00:32, Steve Langasek wrote:
> On Sat, Oct 23, 2010 at 09:42:27PM -0000, gpk wrote:
>> I looked at the packet storm (captured above), and it seems to be a rapid repetition of a single
>> command and a single response.
>> The command is a COMPOUND of PUTFH, WRITE, GETATTR,
>> and the server responds with NFS4ERR_OPENMODE which is
>> in response to a failure of the WRITE.
> ...
> So I guess we have two separate client bugs here: one is that the client
> fails to handle the write failure in a graceful manner (instead it tries the
> request again... "insanity is doing the same thing over and over again and
> expecting different results"); the other is that the write request is
> somehow buggy and fails to have the desired effect of extending the file.

The first is clear.
The second is not obvious. There might be legitimate reasons for the
write request failure,
such as an open with no write permissions. Or, maybe file locking.
It suggests a bug somewhere, but the bug might not be in the write code
itself.
Could be something that happens at the very beginning of the storm.
Possibly, it might even be in bzr.

There's something else a slightly weird, though. In the other packet
storm (mine), there is
a RENEW call. RFC3010 says there is an implicit lease renewal caused
by LOCK and WRITE
calls, so why is the RENEW call needed at all?

I get COMPOUND call<EMPTY tag> PUTFH; LOCK
... PUTFH; WRITE; GETATTR
... RENEW
... PUTFH; LOCK
          (tries to set a WRITE_LT lock)
... PUTF; WRITE; GETATTR
          The write is always 543 bytes;
           The getattr asks for FATTR4_CHANGE and FATTR4_SIZE
                             and FATTR4_TIME_METADATA and FATTR4_TIME_MODIFY
... RENEW
... PUTFH; LOCK
... PUTFH; WRITE; GETATTR
... RENEW
et cetera
The "renew" calls succeed with NFS4_OK
PUTFH; LOCK succeeds with NFS4_OK (twice).
PUTFH; WRITE; GETATTR fails on the WRITE with NFS4ERR_OPENMODE

Revision history for this message

Harry Flink (to-harryflink-from-launchpad) wrote on 2011-03-14:

#9

I think we may have experienced the same bug as noted here because syslog shows sys_ftruncate().

I already posted a bug report here so this is a crosspost but I think this could more proper bug report for our issue:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/578866/comments/4

I try to run "bzr status" on our couple hundred megabyte development branch on our new NFSv4 share. This causes Bazaar on client to halt and cause errors on latest Ubuntu 10.10 AMD64 kernel 2.6.35-27-generic.

This message appears repeately (every 120sec) in the syslog:
Mar 14 17:25:46 hf kernel: [ 1080.078609] INFO: task bzr:2348 blocked for more than 120 seconds.
Mar 14 17:25:46 hf kernel: [ 1080.078611] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 14 17:25:46 hf kernel: [ 1080.078614] bzr D 00000000ffffa30f 0 2348 2347 0x00000000
Mar 14 17:25:46 hf kernel: [ 1080.078618] ffff880341823c08 0000000000000086 ffff880300000000 0000000000015980
Mar 14 17:25:46 hf kernel: [ 1080.078623] ffff880341823fd8 0000000000015980 ffff880341823fd8 ffff880344e3db80
Mar 14 17:25:46 hf kernel: [ 1080.078626] 0000000000015980 0000000000015980 ffff880341823fd8 0000000000015980
Mar 14 17:25:46 hf kernel: [ 1080.078630] Call Trace:
Mar 14 17:25:46 hf kernel: [ 1080.078637] [<ffffffff81101c50>] ? sync_page+0x0/0x50
Mar 14 17:25:46 hf kernel: [ 1080.078641] [<ffffffff81589093>] io_schedule+0x73/0xc0
Mar 14 17:25:46 hf kernel: [ 1080.078644] [<ffffffff81101c8d>] sync_page+0x3d/0x50
Mar 14 17:25:46 hf kernel: [ 1080.078647] [<ffffffff8158996f>] __wait_on_bit+0x5f/0x90
Mar 14 17:25:46 hf kernel: [ 1080.078650] [<ffffffff81101e43>] wait_on_page_bit+0x73/0x80
Mar 14 17:25:46 hf kernel: [ 1080.078655] [<ffffffff8107faf0>] ? wake_bit_function+0x0/0x40
Mar 14 17:25:46 hf kernel: [ 1080.078659] [<ffffffff8110c605>] ? pagevec_lookup_tag+0x25/0x40
Mar 14 17:25:46 hf kernel: [ 1080.078662] [<ffffffff8110230d>] filemap_fdatawait_range+0x10d/0x1a0
Mar 14 17:25:46 hf kernel: [ 1080.078666] [<ffffffff811023cb>] filemap_fdatawait+0x2b/0x30
Mar 14 17:25:46 hf kernel: [ 1080.078669] [<ffffffff811026d4>] filemap_write_and_wait+0x44/0x50
Mar 14 17:25:46 hf kernel: [ 1080.078684] [<ffffffffa1802fec>] nfs_setattr+0x14c/0x160 [nfs]
Mar 14 17:25:46 hf kernel: [ 1080.078688] [<ffffffff8116c69b>] notify_change+0x16b/0x310
Mar 14 17:25:46 hf kernel: [ 1080.078692] [<ffffffff81152694>] do_truncate+0x64/0xa0
Mar 14 17:25:46 hf kernel: [ 1080.078695] [<ffffffff8115292b>] T.784+0xeb/0x110
Mar 14 17:25:46 hf kernel: [ 1080.078698] [<ffffffff8115295e>] sys_ftruncate+0xe/0x10
Mar 14 17:25:46 hf kernel: [ 1080.078703] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b

I think we may have experienced the same bug as noted here because syslog shows sys_ftruncate().

I already posted a bug report here so this is a crosspost but I think this could more proper bug report for our issue:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/578866/comments/4

I try to run "bzr status" on our couple hundred megabyte development branch on our new NFSv4 share. This causes Bazaar on client to halt and cause errors on latest Ubuntu 10.10 AMD64 kernel 2.6.35-27-generic.

This message appears repeately (every 120sec) in the syslog:
Mar 14 17:25:46 hf kernel: [ 1080.078609] INFO: task bzr:2348 blocked for more than 120 seconds.
Mar 14 17:25:46 hf kernel: [ 1080.078611] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 14 17:25:46 hf kernel: [ 1080.078614] bzr D 00000000ffffa30f 0 2348 2347 0x00000000
Mar 14 17:25:46 hf kernel: [ 1080.078618] ffff880341823c08 0000000000000086 ffff880300000000 0000000000015980
Mar 14 17:25:46 hf kernel: [ 1080.078623] ffff880341823fd8 0000000000015980 ffff880341823fd8 ffff880344e3db80
Mar 14 17:25:46 hf kernel: [ 1080.078626] 0000000000015980 0000000000015980 ffff880341823fd8 0000000000015980
Mar 14 17:25:46 hf kernel: [ 1080.078630] Call Trace:
Mar 14 17:25:46 hf kernel: [ 1080.078637] [<ffffffff81101c50>] ? sync_page+0x0/0x50
Mar 14 17:25:46 hf kernel: [ 1080.078641] [<ffffffff81589093>] io_schedule+0x73/0xc0
Mar 14 17:25:46 hf kernel: [ 1080.078644] [<ffffffff81101c8d>] sync_page+0x3d/0x50
Mar 14 17:25:46 hf kernel: [ 1080.078647] [<ffffffff8158996f>] __wait_on_bit+0x5f/0x90
Mar 14 17:25:46 hf kernel: [ 1080.078650] [<ffffffff81101e43>] wait_on_page_bit+0x73/0x80
Mar 14 17:25:46 hf kernel: [ 1080.078655] [<ffffffff8107faf0>] ? wake_bit_function+0x0/0x40
Mar 14 17:25:46 hf kernel: [ 1080.078659] [<ffffffff8110c605>] ? pagevec_lookup_tag+0x25/0x40
Mar 14 17:25:46 hf kernel: [ 1080.078662] [<ffffffff8110230d>] filemap_fdatawait_range+0x10d/0x1a0
Mar 14 17:25:46 hf kernel: [ 1080.078666] [<ffffffff811023cb>] filemap_fdatawait+0x2b/0x30
Mar 14 17:25:46 hf kernel: [ 1080.078669] [<ffffffff811026d4>] filemap_write_and_wait+0x44/0x50
Mar 14 17:25:46 hf kernel: [ 1080.078684] [<ffffffffa1802fec>] nfs_setattr+0x14c/0x160 [nfs]
Mar 14 17:25:46 hf kernel: [ 1080.078688] [<ffffffff8116c69b>] notify_change+0x16b/0x310
Mar 14 17:25:46 hf kernel: [ 1080.078692] [<ffffffff81152694>] do_truncate+0x64/0xa0
Mar 14 17:25:46 hf kernel: [ 1080.078695] [<ffffffff8115292b>] T.784+0xeb/0x110
Mar 14 17:25:46 hf kernel: [ 1080.078698] [<ffffffff8115295e>] sys_ftruncate+0xe/0x10
Mar 14 17:25:46 hf kernel: [ 1080.078703] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b

Revision history for this message

Martin Pool (mbp) wrote on 2011-03-14:

#10

To help us work out whether a workaround in bzr is possible, can someone please interrupt bzr at the point where this is happening, and post the Python traceback that should be written into ~/.bzr.log.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-03-14: Re: [Bug 480444] Re: packet storm with linux NFSv4 client when calling ftruncate()

#11

On Mon, Mar 14, 2011 at 09:42:02PM -0000, Martin Pool wrote:
> To help us work out whether a workaround in bzr is possible, can someone
> please interrupt bzr at the point where this is happening, and post the
> Python traceback that should be written into ~/.bzr.log.

No, the process is uninterruptible when this happens... :)

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

gpk (gpk-kochanski) wrote on 2011-03-14:

#12

Can't do that. It's non-interruptible even with kill -9.

There is output from strace at https://bugs.launchpad.net/bugs/665082

On 14/03/11 21:42, Martin Pool wrote:
> To help us work out whether a workaround in bzr is possible, can someone
> please interrupt bzr at the point where this is happening, and post the
> Python traceback that should be written into ~/.bzr.log.
>

Revision history for this message

Robert Collins (lifeless) wrote on 2011-03-15:

#13

can you get a core perhaps?

Revision history for this message

gpk (gpk-kochanski) wrote on 2011-03-15: Re: [Bug 480444] Re: packet storm with linux NFSv4 client when calling ftruncate()

#14

The only way I can think of would be by taking some kind of memory image
via the kernel.
Unless you know a way, I think you'd best be happy with the strace output.

On 15/03/11 01:13, Robert Collins wrote:
> can you get a core perhaps?
>

Revision history for this message

Pierre-Louis Bonicoli (pierre-louis-bonicoli) wrote on 2011-03-22:

#15

backtrace Edit (35.7 KiB, text/plain)

Using bzr + nfsv4 i reproduce systematically ("bzr status" command non-interruptible even with kill -9. Using "umount -f ...." once or twice terminates bzr command).
I add log statements (logs added with 'mutter' are stored in ~/.bzr.log) in "show_tree_status" from status.py. It appears that this is the last statement of show_tree_status which block: "wt.unblock()" (http://bazaar.launchpad.net/~bzr-pqm/bzr/2.1/view/head:/bzrlib/status.py#L165).

Backtrace attached (done after "umount -f ..."). See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/665082/+attachment/1708220/+files/nfsbug2.txt too.

Revision history for this message

Pierre-Louis Bonicoli (pierre-louis-bonicoli) wrote on 2011-03-31:

#16

After updating some packages, i can not reproduce:
- server & client: nfs-common 1.2.3-1, linux-image-2.6.38-1
- server: nfs-kernel-server 1.2.3, libgssglue 0.1-4
- client: libgssglue 0.2-2

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-03-31:

#17

it's still reproducible for me with a 2.6.38 client and the current version of nfs-common in natty, which is 1.2.2. I am not at liberty to upgrade the server (and this is a bug with a misbehaving client, in any case).

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-07-14: Unsupported series, setting status to "Won't Fix".

#18

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	Triaged → Won't Fix

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-07-14:

#19

This bug is still present in every release since karmic, up to and including oneiric's 3.0 kernel.

tags:	removed: karmic
Changed in linux (Ubuntu):
status:	Won't Fix → Triaged

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-07-19:

#20

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	Triaged → Won't Fix

Revision history for this message

Robert Collins (lifeless) wrote on 2011-07-19:

#21

Reopening per slangaseks comment.

Changed in linux (Ubuntu):
status:	Won't Fix → Triaged

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-07-19:

#22

There was a bug in the bot that marked the bug watch task "Won't Fix". That has been addressed. I apologize for any inconvenience.

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-07-19:

#23

Ignore my last comment, sigh.

Kate Stewart (kate.stewart) on 2011-09-29

tags:

added: rls-mgr-o-tracking

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-10-09:

#24

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in bzr (Ubuntu Lucid):
status:	New → Confirmed
Changed in linux (Ubuntu Lucid):
status:	New → Confirmed

Revision history for this message

Harry Flink (n-from-launchpad) wrote on 2012-09-27:

#26

I'm not an NFS/Bazaar developer so I'm not familiar with the details but this bug is pretty serious in every day use when using NFS shares.

User "gpk" above mentioned this has something to do with sent write/locking requests. Could it be possible to have an easy workaround to this bug like modifying some write sequences etc? I haven't faced this same bug with any other software than Bazaar so other applications access the NFS shares differently and don't trigger this bug.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-11-16:

#27

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	Triaged → Won't Fix

Revision history for this message

Steve Langasek (vorlon) wrote on 2012-11-16:

#28

This remains a serious bug in the NFS implementation in Linux, that has persisted into the current releases. Reopening.

Changed in linux (Ubuntu):
status:	Won't Fix → Triaged

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-11-19:

#29

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	Triaged → Won't Fix

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-11-19:

#30

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.7 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.7-rc6-raring/

tags:

added: kernel-da-key

dino99 (9d9) on 2013-05-18

Changed in linux (Ubuntu Oneiric):
status:	Triaged → Invalid
Changed in linux (Ubuntu Lucid):
status:	Confirmed → Invalid
Changed in bzr (Ubuntu Lucid):
status:	Confirmed → Invalid

Ubuntu
linux package

packet storm with linux NFSv4 client when calling ftruncate()

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
bzr (Ubuntu)	Invalid	Undecided	Unassigned
Lucid	Invalid	Undecided	Unassigned
Oneiric	Invalid	Undecided	Unassigned
linux (Ubuntu)	Won't Fix	High	Unassigned
Lucid	Invalid	Undecided	Unassigned
Oneiric	Invalid	High	Unassigned

Ubuntulinux package

packet storm with linux NFSv4 client when calling ftruncate()

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package