Comment 3 for bug 704015

Revision history for this message
Pete Zaitcev (zaitcev) wrote :

Here's an update. I think I understand now why gholt or notmyname didn't fix this bug in 10 minutes.

There is no "loop". It seems like the client is on CPU, but that's because the reaper thread is doing sleep(0.01). The actual problem is a hang in the worker thread. It hangs here:

        def _object_body():
            buf = resp.read(resp_chunk_size)
            while buf:
                yield buf
                buf = resp.read(resp_chunk_size)
        object_body = _object_body()

It is invoked from here, much later:
                for chunk in body:
                    fp.write(chunk)
                    read_length += len(chunk)
                    if md5sum:
                        md5sum.update(chunk)
                fp.close()

The worker thread goes into resp.read(65536) and hangs in there instead of getting an empty buf back. The resp here is a normal httplib.HTTPResponse, not overloaded or whatever.

Here's what strace looks like. First, here's a usual cycle, including -EAGAIN:

7969 <... select resumed> ) = 0 (Timeout)
7972 <... write resumed> ) = 1
7969 select(0, NULL, NULL, NULL, {0, 4000} <unfinished ...>
7972 write(19, "\276\321S;\262\371Sbz(S\35g\276\230\242t\263\213i\374\316\20\335\253~\301\240v\343\245x"..., 4096) = 4096
7972 write(19, "7P\306kG\357\206V\241\245\303\321\34\373\337\217F\262,\224\240\237\260\265\220\370\207\r\264\364>D"..., 61440) = 61440
7972 recvfrom(18, 0x7f8db0025164, 65536, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
7972 epoll_ctl(5, EPOLL_CTL_ADD, 18, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=18, u64=14381426149290934290}}) = 0
7972 epoll_wait(5, <unfinished ...>
7969 <... select resumed> ) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 2614}) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
.............
7969 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 2000} <unfinished ...>
7972 <... epoll_wait resumed> {{EPOLLIN, {u32=18, u64=14381426149290934290}}}, 1023, 60000) = 1
7972 epoll_ctl(5, EPOLL_CTL_DEL, 18, {EPOLLWRBAND|EPOLLMSG|EPOLLERR|EPOLLHUP|0x19cb020, {u32=0, u64=14381426149290934272}}) = 0
7972 recvfrom(18, "\246R@y\2\201\373,\356\3559Zt\25FT\262q\250\205pvm\333\351J\215E\260\250N\360"..., 65536, 0, NULL, NULL) = 27016
............
7972 recvfrom(18, "\263f\3604\325\307\333\355\352Z|?\6m\303%\214\244\260\341\n <\236x\330\3374.z?a"..., 3768, 0, NULL, NULL) = 3768
7969 <... select resumed> ) = 0 (Timeout)
7972 madvise(0x7f8db0056000, 32768, MADV_DONTNEED <unfinished ...>

So, the library maintains a fiction of a blocking socket, by adding a polling block in case of -EAGAIN, then removing it when EPOLLIN occurs.

The hang occurs like this:

7972 recvfrom(18, "\v\245\375\227\v\306E\357\366/.N\t\317\366\333\340\3768\307\314\315\333\314j\373v\236\v\360\237R"..., 65536, 0, NULL, NULL) = 65536
7972 write(19, "\v\245\375\227\v\306E\357\366/.N\t\317\366\333\340\3768\307\314\315\333\314j\373v\236\v\360\237R"..., 4096) = 4096
7972 write(19, "\300\22J\2458\252\331\315;\302O\vd\343\366\232\277\374\242wpPD\25^\241\0%L\357\215\213"..., 61440) = 61440
7972 recvfrom(18, 0x7f8db0025164, 65536, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
7972 epoll_ctl(5, EPOLL_CTL_ADD, 18, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=18, u64=14381426149290934290}}) = 0
7972 epoll_wait(5, <unfinished ...>
7969 <... select resumed> ) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 2456}) = 0 (Timeout)
....................... hangs forever

The socket closed on server side does not trigger the EPOLLIN -or- EPOLLHUP. Why? No idea. It may be just the way things are in Linux.

I think the right solution is to get the library to use a normal, blocking socket. I know from experience that that one works fine and returns the EOF as expected. We already have a thread, although a green one. I am open to suggestions what methods or classes to overload, or any other suggestions.