iperf3 hangs with -R and -Z flags #129

bmah888 · 2014-02-28T15:31:48Z

From [email protected] on December 20, 2013 14:51:23

When running the new test script (test_commands.sh), the iperf3 client hangs on 2 of the tests:

./src/iperf3 -c $host -P 2 -t 5 -R
and
./src/iperf3 -c $host -Z -t 5

And when you ^C the client, the server dies.

Original issue: http://code.google.com/p/iperf/issues/detail?id=129

bmah888 · 2014-02-28T15:31:49Z

From [email protected] on December 20, 2013 15:20:49

This happened on OSX, but Linux seems OK.

Labels: Milestone-3.0-Release

bmah888 · 2014-02-28T15:31:50Z

From [email protected] on December 22, 2013 07:09:16

This seems to reliably reproduce the problem on linux:

#!/bin/sh
set -x
while [ 1 ]
do
./src/iperf3 -P 2 -c localhost -t 5
./src/iperf3 -P 2 -c localhost -t 5 -R
done

It works for 3-6 loops, and then locks up. (1 time the server crashed).

Hopefully that will help track it down.

Owner: [email protected]
Labels: -Priority-Medium Priority-High

bmah888 · 2014-02-28T15:31:51Z

From [email protected] on December 24, 2013 08:15:42

Running the server in gdb shows that the server is crashing on this line:

Program received signal SIGSEGV, Segmentation fault.
0x000000305784812c in vfprintf () from /lib64/libc.so.6

Which is called from here:

1808 iprintf(test, report_sum_bw_retrans_format, start_time, end_time, ubuf, nbuf, retransmits, irp->omitted?report_omitted:"");

Maybe Sasant's new patch will fix this?

bmah888 · 2014-02-28T15:31:51Z

From [email protected] on December 24, 2013 09:26:55

I am too able to reproduce this . The reverse -R option server getting crashed

getsockopt(5, SOL_TCP, TCP_INFO, "\1\0\0\0\0\7w\0(\21\3\0@\234\0\0\270\377\0\0\30\2\0\0\0\0\0\0\0\0\0\0"..., [104]) = 0
getsockopt(7, SOL_TCP, TCP_INFO, "\1\0\0\0\0\7w\0(\21\3\0@\234\0\0\270\377\0\0\30\2\0\0\0\0\0\0\0\0\0\0"..., [104]) = 0
write(1, "- - - - - - - - - - - - - - - - "..., 50- - - - - - - - - - - - - - - - - - - - - - - - -
) = 50
write(1, "[  5]   8.02-9.00   sec   382 MB"..., 67[  5]   8.02-9.00   sec   382 MBytes  3.27 Gbits/sec    5         
) = 67
write(1, "[  7]   8.02-9.00   sec   381 MB"..., 67[  7]   8.02-9.00   sec   381 MBytes  3.26 Gbits/sec    0         
) = 67
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x5} ---
    killed by SIGSEGV (core dumped)    
Segmentation fault (core dumped)

(gdb) bt
#0 0x000000399144908f in vfprintf () from /lib64/libc.so.6
#1 0x000000000040542a in vprintf (__arg=0x7fffffffda08,
__fmt=0x4110e0 <report_sum_bw_retrans_format> "\340SUM] %6.2f-%-6.2f sec %ss %ss/sec", ' ' <repeats 14 times>, "%s\n") at /usr/include/bits/stdio.h:38
#2 iprintf (test=test@entry=0x617010, format=0x4110e0 <report_sum_bw_retrans_format> "\340SUM] %6.2f-%-6.2f sec %ss %ss/sec", ' ' <repeats 14 times>, "%s\n")
at iperf_api.c:2405
#3 0x000000000040618b in iperf_print_intermediate (test=test@entry=0x617010) at iperf_api.c:1808
#4 0x0000000000406468 in iperf_reporter_callback (test=0x617010) at iperf_api.c:2008
#5 0x000000000040c9ac in tmr_run (nowP=nowP@entry=0x7fffffffdd10) at timer.c:189
#6 0x0000000000409f43 in iperf_run_server (test=test@entry=0x617010) at iperf_server_api.c:586
#7 0x0000000000401e92 in run (test=0x617010) at main.c:116
#8 main (argc=, argv=0x7fffffffdf68) at main.c:91

gdb) f 0
#0 0x000000399144908f in vfprintf () from /lib64/libc.so.6
(gdb) list
43 __STDIO_INLINE int
44 getchar (void)
45 {
46 return _IO_getc (stdin);
47 }
48
49
50 # ifdef __USE_MISC
51 /* Faster version when locking is not necessary. */
52 __STDIO_INLINE int

Looks like the stack is getting corrupted somewhere which is leading to crash
Need to dig more what is really causing the crash

bmah888 · 2014-02-28T15:31:52Z

From [email protected] on December 24, 2013 11:12:44

I've been doing some digging into this. The hang and the crash might have two different causes, or might be two different manifestations of the same problem. Notes from a private email on this subject, where I was describing what I saw with FreeBSD 10.0 and -R. There's a hang but no crash.

A slightly lower level symptom of this problem is that at the end of the
test, the client tries to send an TEST_END state change message to the
server over the control connection. When in -R mode, the server doesn't
seem to get it or read it reliably. However if I kill the client
(because it seems hung) the server immediately gets the TEST_END and
tries to do the end-of-test processing (it can't do this successfully
because at this point the client has died and closed its side of the
control connection).

In non -R mode this part all works as expected (I see the client send
the TEST_END and the server receives it immediately, as we would expect).

This is all on FreeBSD 10.0, client and server on the same machine (so
far it looks like the configuration where client and server are on the
same machine is particularly vulnerable to this problem).

bmah888 · 2014-02-28T15:31:52Z

From [email protected] on January 03, 2014 10:09:27

Partial fix committed in c499d0008f7d. There was basically a deadlock between the client and server in -R mode, see commit log for more details.

Not closing this yet...need to do some more tests to get a warm fuzzy feeling about the fix first. Also note that this doesn't address the server-side crashes that have been reported (but which I have not personally witnessed).

bmah888 · 2014-02-28T15:31:53Z

From [email protected] on January 03, 2014 10:38:48

Fixed the -P and -R server-side crash reported via Comments 2, 3, and 4, in 423166a54849. This only affected Linux; it was a mangled printf format string that only got used on that platform (it would have been used on any other platform with retransmit statistics, but there aren't currently any).

It's clear to me now that there were multiple issues being reported in this one bug. :-p

bmah888 · 2014-02-28T15:31:53Z

From AaronMatthewBrown on January 03, 2014 10:43:53

If gcc isn't spitting out warnings on format strings as const char variables, it'd probably make sense to turn the format strings into typedefs or something to ensure that gcc spits out a warning if this kind of mismatch happens.

bmah888 · 2014-02-28T15:31:54Z

From [email protected] on January 03, 2014 11:04:17

Good point. I don't see any warning messages for the format string mismatch (on a working copy rolled back to before my fix), but gcc isn't compiling with any warnings enabled either, as far as I can tell:

gcc -DHAVE_CONFIG_H -I. -g -O2 -MT iperf_api.o -MD -MP -MF .deps/iperf_api.Tpo -c -o iperf_api.o iperf_api.c

I'm not sure why this is...I'm used to living under -Wall and -Werror. Yet another thing to investigate.

bmah888 · 2014-02-28T15:31:54Z

From [email protected] on January 03, 2014 14:52:55

Update: Just one sub-issue remaining from this bug report...that's the hang with -Z. I've been able to observe this on Mac OS, as reported in the initial bug report. It doesn't happen every time, at least not on my MacBook; sometimes the -Z test works just fine.

So far I have not been able to reproduce this problem on my other two development platforms (FreeBSD 10 and CentOS 6).

It's not clear to me if there's something platform-specific lurking about or not, although the sendfile(2) call used by the -Z option is slightly different on the three platforms I've been using (therefore there are slightly different codepaths being used).

bmah888 · 2014-02-28T15:31:55Z

From bltierney on January 04, 2014 07:21:54

In my tests, OSX hangs every time. Linux is now working fine.

bmah888 · 2014-02-28T15:31:55Z

From [email protected] on January 21, 2014 13:08:21

Update: I'm still seeing this issue (but not consistently) on MacOS 10.8 and MacOS 10.9.

bmah888 · 2015-01-02T17:40:48Z

Somewhat prompted by issue #231, I retested this (MacOS, -Z flag TCP tests, mainline code) on MacOS 10.10.1. I did twelve 10-second tests and didn't see a single failure. I'm now running a bunch of 5-second tests in a tight loop; haven't seen anything yet. That doesn't mean the bug is gone, although it's doing much better than I've ever remember seeing before.

bmah888 · 2015-01-05T22:30:01Z

By mutual agreement, @bltierney and I decided we should just close this bug, since it can't be reproduced (see previous comment).

bmah888 added this to the 3.0 milestone Feb 28, 2014

bmah888 removed the Milestone-3.0-Release label Feb 28, 2014

bmah888 added a commit that referenced this issue May 1, 2014

Update known issues for #55, #125, and #129.

b957be4

bmah888 removed the Priority-High label May 12, 2014

bmah888 self-assigned this May 12, 2014

bmah888 removed this from the 3.0 milestone Jun 10, 2014

bmah888 mentioned this issue Dec 22, 2014

Fix calculation of sendfile throughput on OSX #231

Merged

bmah888 closed this as completed Jan 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iperf3 hangs with -R and -Z flags #129

iperf3 hangs with -R and -Z flags #129

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Feb 28, 2014

bmah888 commented Jan 2, 2015

bmah888 commented Jan 5, 2015