Bug 248 - Newer version of tcp-2way that exhibits an 'excess bytes' problem
Newer version of tcp-2way that exhibits an 'excess bytes' problem
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: internet
pre-release
PC Linux
: P1 normal
Assigned To: ns-bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-07-07 20:09 UTC by evensky
Modified: 2008-07-29 15:51 UTC (History)
1 user (show)

See Also:


Attachments
test case that shows the bug. (32.11 KB, text/plain)
2008-07-07 20:20 UTC, evensky
Details
output with the 'bad timings' showing weird number of bytes. (10.65 KB, text/plain)
2008-07-15 01:17 UTC, evensky
Details
output with the 'good/regular timings' that runs correctly. (14.04 KB, text/plain)
2008-07-15 01:18 UTC, evensky
Details
corrected tcp-2way-2.cc (32.12 KB, text/plain)
2008-07-18 16:08 UTC, evensky
Details
even more correct version of tcp-2way-2.cc (32.46 KB, text/plain)
2008-07-21 14:45 UTC, evensky
Details
Add some asserts that show the logical inconsitency (1.19 KB, patch)
2008-07-28 16:17 UTC, Rajib Bhattacharjea
Details | Diff
fixes the issue (2.71 KB, patch)
2008-07-28 17:11 UTC, Rajib Bhattacharjea
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description evensky 2008-07-07 20:09:50 UTC
My real code exhibits a problem that during a run it send 10000 bytes of payload and on the receiving side seems to have 10043 bytes of payload. This causes an error when I try to do a RemoveHeader, but there isn't a header there, but some left over payload. I modified tcp-2way to have two options. The significant option that shows the bug is:
   --useBadTimings=1

which is false by default.

What this does is change the way packets are scheduled. In the old tcp-2way packets were all pre-scheduled before Simulator::Run() is called. The times that the packets were to be sent was by incrementing the time regularly. Using the --useBadTimings=1 option causes the timings to be the same that exists in my real app (at least until it dies).

By changing useBadTimings from 0 to 1 you can cause the program to fail.

program and differences from expected output to follow.
Comment 1 evensky 2008-07-07 20:20:27 UTC
Created attachment 194 [details]
test case that shows the bug.

Run as:
./tcp-2way-2 --useDummyBytes=0 --recvSize=600 --preSchedule=1 --useBadTimings=0

to have correct results. It has output like:
...
 536:I{536} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|536|381>
  381:I{381} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|381|0>
  Finished Packet.  from handleread: cmpl_recv(0.328662,0,0,10000,1) Read 10000 of <I>
simulator::cmpl_recv(0.328662,0,0,10000,1)
....

where all 10000 bytes were read, and there is no left over bytes (<0|) in the packet, so processing on the packet ends.

In the bad case:
./tcp-2way-2 --useDummyBytes=0 --preSchedule=1 --recvSize=600 --useBadTimings=1
...
  536:I{536} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|536|338>
  381:I{381} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <43|338|0>
  Finished Packet.  from handleread: cmpl_recv(0.0221596,0,0,10000,1) Read 10000 of <I>
simulator::cmpl_recv(0.0221596,0,0,10000,1)
   43:I{43}  bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] Removing unexpected header.
....

Notice that there appears to be 43bytes left over, and in the rle of the PeekData in the packet there are 43 bytes of payload 'I'. What is confusing is that the program send 10000 bytes of 'I' as payload but there appears to be 10043 bytes of 'I' as payload.

Also note with the 'bad' timing we are seeing both directions being interleaved here.

Thanks,
\dae
Comment 2 evensky 2008-07-08 01:31:14 UTC
A few more comments: It looks like differences arise on the unfortunate payload ('I' filled) at the start. For some reason the initial packet is larger than the others. There is nothing wrong with that, the recvSize is 600, but it is odd:

The without the useBadTimings we have at the start of the payload:
Before Marking|Payload (size=10000)
After Header Marking|ns3::ArchHeader ((ArchHeader=I:10000:0:0:0)) Payload (size=10000)
my_sim->cmpl_send(0.32,0,0,10000,0)
  536:.'.{26}I{508} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|507(29)|9493>
  536:I{536} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|536|8957>
  536:I{536} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|536|8421>
  536:I{536} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|536|7885>
....

showing the send of 10000 bytes and the first few chunks to be read in. Note that in this case we get 536 bytes of which 507 are payload. (the rle has an 'I' as part of the header which is why there are 508 shown in the rle dump).

If we turn on the useBadTimings, we get:
Before Marking|Payload (size=10000)
After Header Marking|ns3::ArchHeader ((ArchHeader=I:10000:0:0:0)) Payload (size=10000)
my_sim->cmpl_send(0.00296172,0,0,10000,0)
  493:H{493} *bytes from 10.1.0.2 [02-06-0a:01:00:02:20:00] <0|493|0>
  Finished Packet.  from handleread: cmpl_recv(0.00296172,0,0,1000,0) Read 1000 of <H>
simulator::cmpl_recv(0.00296172,0,0,1000,0)
  579:.'.{26}I{551} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|550(29)|9450>
  536:I{536} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|536|8914>
  536:I{536} *bytes from 10.1.0.1 [02-06-0a:01:00:01:01:c0] <0|536|8378>

note here we have the last bit of the 'H' filled payload interleaved between the send of the 1000 'I' filled bytes, and the recv of those bytes. Also, we have 43 more bytes in that first chunk, again still less that the 600 we allowed on the command line, but odd that it is different. Also odd that 43 is the number of bytes we seem to have in excess.
Comment 3 evensky 2008-07-08 01:47:33 UTC
one last note tonight. I have if def'ed out the use of the SendCallback during my attempts to debug. However, the results don't seem to have when I define USE_HANDLESEND to put that code back in.
Comment 4 Mathieu Lacage 2008-07-08 11:27:34 UTC
ccing raj
Comment 5 Rajib Bhattacharjea 2008-07-08 11:28:39 UTC
I believe I'm not even getting to the erroneous point, but getting crasher
before even getting to this point (using today's ns-3-dev):
./tcp-2way-2 --useDummyBytes=0 --recvSize=600 --preSchedule=1 --useBadTimings=0
tcpip_configure_pt2pt_channel(0,600,)
Prescheduling Packets.
ScheduleNexttime=0,tag=A,size=0
ScheduleNexttime=0.04,tag=B,size=0
ScheduleNexttime=0.08,tag=C,size=10
ScheduleNexttime=0.12,tag=D,size=10
ScheduleNexttime=0.16,tag=E,size=100
ScheduleNexttime=0.2,tag=F,size=100
ScheduleNexttime=0.24,tag=G,size=1000
ScheduleNexttime=0.28,tag=H,size=1000
ScheduleNexttime=0.32,tag=I,size=10000
ScheduleNexttime=0.36,tag=J,size=10000
subsequent ScheduleNext() REUSING time 0.36
ScheduleNext() returns false
Before Marking|
assert failed. file=/home/raj/code.nsnam.org/ns-3-dev/build/debug/ns3/buffer.h,
line=546, cond="Check (m_current)"
Trace/breakpoint trap

In gdb:
Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0xb6cc4720 (LWP 23707)]
ns3::Buffer::Iterator::WriteU8 (this=0xbf891e54, data=0 '\0') at
/home/raj/code.nsnam.org/ns-3-dev/build/debug/ns3/buffer.h:548
548       if (m_current < m_zeroStart)
(gdb) bt
#0  ns3::Buffer::Iterator::WriteU8 (this=0xbf891e54, data=0 '\0') at
/home/raj/code.nsnam.org/ns-3-dev/build/debug/ns3/buffer.h:548
#1  0xb7b93ed7 in ns3::Buffer::Iterator::WriteU64 (this=0xbf891e54, data=0) at
../src/common/buffer.cc:802
#2  0x0805e7bb in ns3::ArchHeader::Serialize (this=0xbf892068, i=
      {m_zeroStart = 17, m_zeroEnd = 17, m_dataStart = 0, m_dataEnd = 17,
m_current = 18, m_data = 0x808a9b0 ""}) at tcp-2way-2.cc:330
#3  0xb7bc6348 in ns3::Packet::AddHeader (this=0x808a968, header=@0xbf892068)
at ../src/common/packet.cc:190
#4  0x0805bc88 in ns3::TalkerApp::DoSendPacket (this=0x8089240,
sock=@0xbf892178, fill=65 'A', payloadsize=0, msgid=0, src=0, dest=0)
    at tcp-2way-2.cc:509
#5  0x0805c1e5 in ns3::TxTalkerApp::SendPacket (this=0x8089240, fill=65 'A',
payloadsize=0, msgid=0, src=0, dest=0)
    at tcp-2way-2.cc:964
#6  0x08056d0b in Notify (this=0x80894c0) at
/home/raj/code.nsnam.org/ns-3-dev/build/debug/ns3/simulator.h:813
#7  0xb7b67186 in ns3::EventImpl::Invoke (this=0x80894c0) at
../src/simulator/event-impl.cc:39
#8  0xb7b6e909 in ns3::SimulatorPrivate::ProcessOneEvent (this=0x8087098) at
../src/simulator/simulator.cc:207
#9  0xb7b6e95b in ns3::SimulatorPrivate::Run (this=0x8087098) at
../src/simulator/simulator.cc:235
#10 0xb7b71286 in ns3::Simulator::Run () at ../src/simulator/simulator.cc:480
#11 0x0805f89c in main (argc=5, argv=0xbf892484) at tcp-2way-2.cc:1305

This looks to be a bug in the ArchHeader code that's blocking investigation of
the actual issue.  Can you specify a revision number of ns-3-dev which exhibits
the issue?
Comment 6 evensky 2008-07-08 11:34:31 UTC
raj, how do I find out what version/rev of ns-3-dev I'm using?
Comment 7 evensky 2008-07-08 11:40:47 UTC
raj, is this what you need:

evensky:~/ns-3-dev$ hg head
changeset:   3383:f21d3d5926b9
tag:         tip
user:        Craig Dowell <craigdo@ee.washington.edu>
date:        Thu Jul 03 20:03:03 2008 -0700
summary:     some slips of the keyboard in tutorial

Comment 8 Rajib Bhattacharjea 2008-07-08 11:50:30 UTC
okay, i still get the assertion failure on x86 ubuntu, g++ (GCC) 4.2.3 (Ubuntu 4.2.3-2ubuntu7).

looks like i'll have to debug this issue before we get to the erroneous amount of data issue.
Comment 9 evensky 2008-07-08 12:32:51 UTC
Raj, I did a hg pull/update to get the current version (can't sit still too long with ns-3-dev!) and I still see the same behavior and no crash re ArchHeader that you are seeing. I'm using a current version of the day Debian Sid but with a hand rolled kernel 2.6.24.7 and g++: g++ (GCC) 4.2.4 (Debian 4.2.4-3) on a Lenovo X61s
(core2 duo).


Comment 10 evensky 2008-07-08 19:52:19 UTC
FYI, for kicks I recompiled current ns-3-dev (3397:cad55f67f3fc) and my test code with g++ (Debian 4.3.1-5) 4.3.1, but it had no effect. I was seeing maybe I could duplicate the bug that Raj is seeing on his setup.
Comment 11 Rajib Bhattacharjea 2008-07-14 14:44:11 UTC
Okay, so I guess we need more context, because I'm on x86 OS X 10.4, g++ 4.0.1 (Apple Computer, Inc. build 5363), and I still get the same crasher.  I dropped the script into the scratch directory, and had to remove an unused local variable.  I also tried building with:

g++ -DNS3_ASSERT_ENABLE -DNS3_LOG_ENABLE -DDEBUG -g -O0 -I/Users/Raj/code.nsnam.org/ns-3-dev/build/debug -I. -L/Users/Raj/code.nsnam.org/ns-3-dev/build/debug -lns3 -o tcp-2way bug248tcp-2way.cc

Both give me the same crasher.

Can someone download this attachment fresh and compile to get the actual error we are after?
Comment 12 evensky 2008-07-15 01:15:31 UTC
I may not be the person you are looking at to reproduce, but:

evensky@waltz$ md5sum tcp-2way-2.cc 
2ab7aa373022bc4d232d027fef23df11  tcp-2way-2.cc
evensky@waltz$ g++ -o tcp-2way-2 tcp-2way-2.cc -g -O0 -Wall -DNS3_ASSERT_ENABLE -DNS3_LOG_ENABLE -DDEBUG -I/home/evensky/ns-3-dev/build/debug  -L/home/evensky/ns-3-dev/build/debug -Wl,-rpath=/home/evensky/ns-3-dev/build/debug -lns3
tcp-2way-2.cc: In function 'void setup_pt2pt(bool, bool, bool, uint32_t, uint32_t, ns3::ApplicationContainer*)':
tcp-2way-2.cc:1151: warning: unused variable 'count'

evensky@waltz$ ./tcp-2way-2 --useDummyBytes=0 --preSchedule=1 --useBadTimings=1 --doLogging=1 > bad.out 2>&1
Trace/breakpoint trap

evensky@waltz$ ./tcp-2way-2 --useDummyBytes=0 --preSchedule=1 --useBadTimings=0 --doLogging=1 > good.out 2>&1
Comment 13 evensky 2008-07-15 01:17:21 UTC
Created attachment 200 [details]
output with the 'bad timings' showing weird number of bytes.
Comment 14 evensky 2008-07-15 01:18:27 UTC
Created attachment 201 [details]
output with the 'good/regular timings' that runs correctly.
Comment 15 evensky 2008-07-18 13:34:54 UTC
Raj, I found a live-cd with g++ (KNOPPIX_V5.3.1DVD-2008-03-26-EN.iso) and have confirmed the bug you are seeing in in ArchHeader. I don't know why my Debian sid environment doesn't show this bug. I'll rework tcp-2way-c.cc.

Comment 16 Rajib Bhattacharjea 2008-07-18 13:38:00 UTC
Thank goodness!  I really want to get to the bottom of your issue, BUT it is also very troublesome that we are getting this platform dependence issue.
Comment 17 evensky 2008-07-18 15:31:34 UTC
I'm wondering what is going on here...

I just tried: Red Hat Enterprise Linux Server release 5.1 (Tikanga) system using
g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-38), and it works like my Debian sid system. I've also tried a 32bit Debian sid system, and that does show the behavior
you've seen (ArchHeader bug). I'll do my development on that, but I wonder if the
platform dependency is also a bug....
Comment 18 Rajib Bhattacharjea 2008-07-18 15:53:42 UTC
(In reply to comment #17)
> I'm wondering what is going on here...
> 
> I just tried: Red Hat Enterprise Linux Server release 5.1 (Tikanga) system
> using
> g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-38), and it works like my Debian sid
> system. I've also tried a 32bit Debian sid system, and that does show the
> behavior
> you've seen (ArchHeader bug). I'll do my development on that, but I wonder if
> the
> platform dependency is also a bug....
> 

Yes, that is precisely what I am thinking, and its really worrisome to have this kind of platform dependent behavior.  Does this seem like a 32/64 bit issue?  Were the machines it works on all using e.g. x86-64?
Comment 19 evensky 2008-07-18 16:08:22 UTC
Created attachment 203 [details]
corrected tcp-2way-2.cc

I had some sizeof(long) rather than sizeof(uint64_t) corresponding to WriteU64. It unluckily worked on 64 versions of linux, but failed on the Mac's gcc, and 32 bit versions of Knoppix and Debian sid.
Comment 20 Mathieu Lacage 2008-07-21 11:43:12 UTC
(In reply to comment #19)
> Created an attachment (id=203) [details]
> corrected tcp-2way-2.cc
> 
> I had some sizeof(long) rather than sizeof(uint64_t) corresponding to WriteU64.
> It unluckily worked on 64 versions of linux, but failed on the Mac's gcc, and
> 32 bit versions of Knoppix and Debian sid.


I should point out that you could also write this:
 uint32_t
  ArchHeader::Deserialize (Buffer::Iterator i)
  {
    Buffer::Iterator start = i;
    m_size    = i.ReadU32 ();
    m_msgid   = i.ReadU64 ();
    m_src     = i.ReadU64 ();
    m_dest    = i.ReadU64 ();
    m_tagchar = i.ReadU8 ();
    return i.GetDistanceFrom (start);
  }

Comment 21 evensky 2008-07-21 11:51:11 UTC
that's much better for the # bytes! I'll roll that in.... Thx.
Comment 22 Rajib Bhattacharjea 2008-07-21 12:40:54 UTC
Can you generate traces for this example?  I.e. does the following work?

./tcp-2way-2 --useDummyBytes=0 --recvSize=600 --preSchedule=1 --useBadTimings=1 --createTraces=1
Comment 23 evensky 2008-07-21 13:18:34 UTC
to document here what was reported on irc: I get a segv if createTraces is turned on.
Comment 24 evensky 2008-07-21 14:45:58 UTC
Created attachment 205 [details]
even more correct version of tcp-2way-2.cc

I fixed a few bugs:

1. made the ofstream ascii static so that it lives long enough for the run.
2. fixed bad usage of int/uint32_t that needed to do the right thing if negative

3. following lacage, made ArchHeader::Deserialize less fragile, but ArchHeader::GetSerializedSize still fragile...
Comment 25 Rajib Bhattacharjea 2008-07-28 16:17:19 UTC
Created attachment 214 [details]
Add some asserts that show the logical inconsitency

These asserts fail when useBadTimings is true, and not otherwise.  So the issue is the management of the tcp rx buffer, namely, the code doesn't handle some cases of overlapping received sequences, it doesn't fragment correctly to assure reliable delivery to the application, it can incorrectly put things into the buffer without enforcing the logic of the sequence numbers of the buffer.

Fix is on the way.
Comment 26 Rajib Bhattacharjea 2008-07-28 17:11:29 UTC
Created attachment 215 [details]
fixes the issue

Added the additional logic to make this particular case work.  I'm not going to push yet because the fix needs to be implemented elsewhere in the tcp code to make sure this doesn't happen again, namely these checks need to be done whenever we insert into the rx buffer.  The patch is for David's immediate use though, for his patience ;-)
Comment 27 evensky 2008-07-28 18:52:11 UTC
I've applied and tested Raj's patch and it works for me. Thx.
Comment 28 Rajib Bhattacharjea 2008-07-29 15:51:42 UTC
Fixed in changeset 3493:53ec8893863d