st and client.py's reads aren't checking for EOF condition

Bug #704015 reported by gholt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Expired
Low
Unassigned

Bug Description

If a file's size changes during read (or for some other reason the EOF is reached earlier than expected) st and client.py enter an endless CPU-eating loop.

Revision history for this message
Pete Zaitcev (zaitcev) wrote :

This script fails to reproduce the problem: the download of the object always succeeds.

How would one go about reducing the size of file?

Revision history for this message
John Dickinson (notmyname) wrote :

A guess on how I would test it

1) Upload file A with length X (this writes it to replicas I, J, and K)

2) fail replica K

3) Upload file B with length Y (where X>Y) to the same object name (this writes to replicas I and J)

4) restore replica K

5) begin downloading the object from replica K

6) have replication copy from replica I or J to replica K while the client is downloading the object

Revision history for this message
Pete Zaitcev (zaitcev) wrote :
Download full text (3.9 KiB)

Here's an update. I think I understand now why gholt or notmyname didn't fix this bug in 10 minutes.

There is no "loop". It seems like the client is on CPU, but that's because the reaper thread is doing sleep(0.01). The actual problem is a hang in the worker thread. It hangs here:

        def _object_body():
            buf = resp.read(resp_chunk_size)
            while buf:
                yield buf
                buf = resp.read(resp_chunk_size)
        object_body = _object_body()

It is invoked from here, much later:
                for chunk in body:
                    fp.write(chunk)
                    read_length += len(chunk)
                    if md5sum:
                        md5sum.update(chunk)
                fp.close()

The worker thread goes into resp.read(65536) and hangs in there instead of getting an empty buf back. The resp here is a normal httplib.HTTPResponse, not overloaded or whatever.

Here's what strace looks like. First, here's a usual cycle, including -EAGAIN:

7969 <... select resumed> ) = 0 (Timeout)
7972 <... write resumed> ) = 1
7969 select(0, NULL, NULL, NULL, {0, 4000} <unfinished ...>
7972 write(19, "\276\321S;\262\371Sbz(S\35g\276\230\242t\263\213i\374\316\20\335\253~\301\240v\343\245x"..., 4096) = 4096
7972 write(19, "7P\306kG\357\206V\241\245\303\321\34\373\337\217F\262,\224\240\237\260\265\220\370\207\r\264\364>D"..., 61440) = 61440
7972 recvfrom(18, 0x7f8db0025164, 65536, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
7972 epoll_ctl(5, EPOLL_CTL_ADD, 18, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=18, u64=14381426149290934290}}) = 0
7972 epoll_wait(5, <unfinished ...>
7969 <... select resumed> ) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 2614}) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
.............
7969 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 2000} <unfinished ...>
7972 <... epoll_wait resumed> {{EPOLLIN, {u32=18, u64=14381426149290934290}}}, 1023, 60000) = 1
7972 epoll_ctl(5, EPOLL_CTL_DEL, 18, {EPOLLWRBAND|EPOLLMSG|EPOLLERR|EPOLLHUP|0x19cb020, {u32=0, u64=14381426149290934272}}) = 0
7972 recvfrom(18, "\246R@y\2\201\373,\356\3559Zt\25FT\262q\250\205pvm\333\351J\215E\260\250N\360"..., 65536, 0, NULL, NULL) = 27016
............
7972 recvfrom(18, "\263f\3604\325\307\333\355\352Z|?\6m\303%\214\244\260\341\n <\236x\330\3374.z?a"..., 3768, 0, NULL, NULL) = 3768
7969 <... select resumed> ) = 0 (Timeout)
7972 madvise(0x7f8db0056000, 32768, MADV_DONTNEED <unfinished ...>

So, the library maintains a fiction of a blocking socket, by adding a polling block in case of -EAGAIN, then removing it when EPOLLIN occurs.

The hang occurs like this:

7972 recvfrom(18, "\v\245\375\227\v\306E\357\366/.N\t\317\366\333\340\3768\307\314\315\333\314j\373v\236\v\360\237R"..., 65536, 0, NULL, NULL) = 65536
7972 write(19, "\v\245\375\227\v\306E\357\366/.N\t\317\366\333\340\3768\307\314\315\333\314j\373v\236\v\360\237R"..., 4096) = 4096
7972 write(19, "\300\22J\2458\252\331\315;\302O\vd\343\366\232\277\374\242wpPD\25^\241\0%L\357\215\213"..., 61440) = 61440
7972 r...

Read more...

Revision history for this message
Pete Zaitcev (zaitcev) wrote :

The problem that I reproduced (or "reproduced") using truncate(1) turned out not to have anything with the client at all. If the file is truncated, the DiskFile in object server reaches EOF, and quits. The proxy server does nothing about it either. However, the client does not know that there was an EOF, so it just sits there. Meanwhile, servers keep their connections open and cached, and so we just hung. In theory, there should be a timeout on an idle connection, but it does not happen. No matter, object server has to signal the premature EOF, and proxy server has to pass it to client. And the only way to do it in HTTP is to break the TCP connection.

Pete Zaitcev (zaitcev)
Changed in swift:
status: New → In Progress
assignee: nobody → Pete Zaitcev (zaitcev)
Revision history for this message
gholt (gholt) wrote :

I was just re-alerted to this bug report, sorry I haven't been paying attention.

I'm pretty sure the original bug was due to the file size changing for a file being uploaded.

It reads like someone mentioned it to me in a conversation and I dropped it in here real quick.

Revision history for this message
Pete Zaitcev (zaitcev) wrote :

Very well. Since I am unable to make uploads and downloads race each other, we need a reproducer (either a script or a set of instructions). otherwise I may be chasing an imaginary bug, because I "reproduce" it by tinkering with the backing store directly.

I'm going to put this away for now.

Revision history for this message
Pete Zaitcev (zaitcev) wrote :

Saving WIP for future reference.

Changed in swift:
importance: Undecided → Low
Revision history for this message
Pete Zaitcev (zaitcev) wrote :

See also bug 736485: apparently Mike Barton had clients hanging forever too.

Revision history for this message
Tom Fifield (fifieldt) wrote :

This bug has been idle for a long time. Can it be reproduced in a supported version?

Changed in swift:
status: In Progress → Incomplete
assignee: Pete Zaitcev (zaitcev) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Object Storage (swift) because there has been no activity for 60 days.]

Changed in swift:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.