Bug 347

Summary: Traces differ in 32 and 64 bit modes
Product: nsc Reporter: Sam Jansen <sam.jansen>
Component: LinuxAssignee: Sam Jansen <sam.jansen>
Status: RESOLVED WONTFIX    
Severity: normal CC: gjcarneiro
Priority: P3    
Version: unspecified   
Hardware: All   
OS: All   

Description Sam Jansen 2008-09-13 22:08:22 UTC
Traces for 32 and 64-bit Linux differ, whether 2.6.18 or 2.6.26.

For 2.6.18, port and sequence numbers differ. For both, the traces diverge after some amount of simulation.

Additional information from Florian:

"./waf --run 'tcp-nsc-lfn --ns3::OnOffApplication::DataRate=11000'
creates the same trace files on x86 and x86_64.
bumping DataRate to 12000 causes traces to diverge (lfn example uses
the linux 2.6.26 stack)."
Comment 1 Sam Jansen 2008-09-13 22:09:31 UTC
The port/sequence number problem for 2.6.18 is solved; see https://secure.wand.net.nz/mercurial/nsc/rev/d551aea44bf4 -- this has not yet been merged onto the head branch but presumably will be before long.

Investigation continues into the other differences.
Comment 2 Sam Jansen 2008-09-14 01:25:30 UTC
I am not sure whether we will be able to resolve this bug. My debugging so far has found the following:

In the function __alloc_skb(net/core/skbuff.c), the total size of the data allocated for the skbuff (skb->truesize) is dependent on sizeof(struct sk_buff). On 32-bit systems, this is 152 bytes. On 64-bit systems, this is 216 bytes. So I can see in gdb that on a 32-bit system an sk_buff is allocated with skb->truesize=1944, where on a 64 bit system, skb->truesize=2008. In both cases __alloc_skb is called with the same arguments (size=1792).

It is this value, skb->truesize, that is used in accounting for the amount of memory used in that socket. For example, sk_charge_skb does: "sk->sk_wmem_queued += skb->truesize;"

sk->sk_wmem_queued is then used in important decisions taken in tcp_sendmsg. e.g., sk_stream_memory_free is implemented as "return sk->sk_wmem_queued < sk->sk_sndbuf;". When sk_stream_memory_free returns false in tcp_sendmsg, tcp_push() may be called, which naturally sets the PSH flag. The difference in PSH flag is the first difference we see in traces.

This basic difference in memory used per sk_buff will have other repercussions which will cause the traces to diverge as well. Basically, in 64-bit mode, you are filling up your buffers quicker than in 32-bit mode.

Looking at include/linux/skbuff.h this is due to the sk_buff structure having many pointers in it; these are all naturally twice the size in 64-bit.

I don't see any easy solution to this, though ideas are welcome. At this stage I believe 64-bit NSC to be working correctly. This gives us a bit of a headache for regression testing.

Comment 3 Gustavo J. A. M. Carneiro 2008-09-15 09:39:31 UTC
It sounds like the linux kernel itself behaves differently in x86 and x86_64.  If so, this sounds unsolvable with the current regression testing framework.
Comment 4 Craig Dowell 2008-09-16 12:18:06 UTC
There is no rule that the regression test itself needs to call back into the helper to execute the test and compare the results.  We don't need a 1:1 relationship between trace files and tests (at least this used to be the case). Couldn't there be two trace directories, one for 32-bit and one for 64-bit; and the regression test could select between which one to compare based on the results from:

  import platform
  platform.architecture()

Which returns something like

  ('32bit', 'WindowsPE')
  ('64bit', '')


Comment 5 Sam Jansen 2008-09-17 17:42:55 UTC
So, the end result of this is that traces are meant to differ in 32 and 64 bit modes.

See bug 345 for what this means to the ns-3 regression testing of NSC.