Here's an update. I think I understand now why gholt or notmyname didn't fix this bug in 10 minutes.
There is no "loop". It seems like the client is on CPU, but that's because the reaper thread is doing sleep(0.01). The actual problem is a hang in the worker thread. It hangs here:
It is invoked from here, much later:
for chunk in body: fp.write(chunk) read_length += len(chunk) if md5sum: md5sum.update(chunk) fp.close()
The worker thread goes into resp.read(65536) and hangs in there instead of getting an empty buf back. The resp here is a normal httplib.HTTPResponse, not overloaded or whatever.
Here's what strace looks like. First, here's a usual cycle, including -EAGAIN:
The socket closed on server side does not trigger the EPOLLIN -or- EPOLLHUP. Why? No idea. It may be just the way things are in Linux.
I think the right solution is to get the library to use a normal, blocking socket. I know from experience that that one works fine and returns the EOF as expected. We already have a thread, although a green one. I am open to suggestions what methods or classes to overload, or any other suggestions.
Here's an update. I think I understand now why gholt or notmyname didn't fix this bug in 10 minutes.
There is no "loop". It seems like the client is on CPU, but that's because the reaper thread is doing sleep(0.01). The actual problem is a hang in the worker thread. It hangs here:
def _object_body(): resp_chunk_ size)
yield buf resp_chunk_ size)
buf = resp.read(
while buf:
buf = resp.read(
object_body = _object_body()
It is invoked from here, much later:
fp. write(chunk)
read_ length += len(chunk)
if md5sum:
md5sum. update( chunk)
fp.close( )
for chunk in body:
The worker thread goes into resp.read(65536) and hangs in there instead of getting an empty buf back. The resp here is a normal httplib. HTTPResponse, not overloaded or whatever.
Here's what strace looks like. First, here's a usual cycle, including -EAGAIN:
7969 <... select resumed> ) = 0 (Timeout) \262\371Sbz( S\35g\276\ 230\242t\ 263\213i\ 374\316\ 20\335\ 253~\301\ 240v\343\ 245x".. ., 4096) = 4096 357\206V\ 241\245\ 303\321\ 34\373\ 337\217F\ 262,\224\ 240\237\ 260\265\ 220\370\ 207\r\264\ 364>D". .., 61440) = 61440 EPOLLPRI| EPOLLERR| EPOLLHUP, {u32=18, u64=14381426149 290934290} }) = 0 290934290} }}, 1023, 60000) = 1 EPOLLMSG| EPOLLERR| EPOLLHUP| 0x19cb020, {u32=0, u64=14381426149 290934272} }) = 0 y\2\201\ 373,\356\ 3559Zt\ 25FT\262q\ 250\205pvm\ 333\351J\ 215E\260\ 250N\360" ..., 65536, 0, NULL, NULL) = 27016 3604\325\ 307\333\ 355\352Z| ?\6m\303% \214\244\ 260\341\ n <\236x\ 330\3374. z?a"... , 3768, 0, NULL, NULL) = 3768 0x7f8db0056000, 32768, MADV_DONTNEED <unfinished ...>
7972 <... write resumed> ) = 1
7969 select(0, NULL, NULL, NULL, {0, 4000} <unfinished ...>
7972 write(19, "\276\321S;
7972 write(19, "7P\306kG\
7972 recvfrom(18, 0x7f8db0025164, 65536, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
7972 epoll_ctl(5, EPOLL_CTL_ADD, 18, {EPOLLIN|
7972 epoll_wait(5, <unfinished ...>
7969 <... select resumed> ) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 2614}) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
.............
7969 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 2000} <unfinished ...>
7972 <... epoll_wait resumed> {{EPOLLIN, {u32=18, u64=14381426149
7972 epoll_ctl(5, EPOLL_CTL_DEL, 18, {EPOLLWRBAND|
7972 recvfrom(18, "\246R@
............
7972 recvfrom(18, "\263f\
7969 <... select resumed> ) = 0 (Timeout)
7972 madvise(
So, the library maintains a fiction of a blocking socket, by adding a polling block in case of -EAGAIN, then removing it when EPOLLIN occurs.
The hang occurs like this:
7972 recvfrom(18, "\v\245\ 375\227\ v\306E\ 357\366/ .N\t\317\ 366\333\ 340\3768\ 307\314\ 315\333\ 314j\373v\ 236\v\360\ 237R".. ., 65536, 0, NULL, NULL) = 65536 375\227\ v\306E\ 357\366/ .N\t\317\ 366\333\ 340\3768\ 307\314\ 315\333\ 314j\373v\ 236\v\360\ 237R".. ., 4096) = 4096 2458\252\ 331\315; \302O\vd\ 343\366\ 232\277\ 374\242wpPD\ 25^\241\ 0%L\357\ 215\213" ..., 61440) = 61440 EPOLLPRI| EPOLLERR| EPOLLHUP, {u32=18, u64=14381426149 290934290} }) = 0 ....... ....... .. hangs forever
7972 write(19, "\v\245\
7972 write(19, "\300\22J\
7972 recvfrom(18, 0x7f8db0025164, 65536, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
7972 epoll_ctl(5, EPOLL_CTL_ADD, 18, {EPOLLIN|
7972 epoll_wait(5, <unfinished ...>
7969 <... select resumed> ) = 0 (Timeout)
7969 select(0, NULL, NULL, NULL, {0, 2456}) = 0 (Timeout)
.......
The socket closed on server side does not trigger the EPOLLIN -or- EPOLLHUP. Why? No idea. It may be just the way things are in Linux.
I think the right solution is to get the library to use a normal, blocking socket. I know from experience that that one works fine and returns the EOF as expected. We already have a thread, although a green one. I am open to suggestions what methods or classes to overload, or any other suggestions.