Bug 556

Summary: DcfManager does not handle AckTimeout properly when there are multiple queues
Product: ns-3 Reporter: Mathieu Lacage <mathieu.lacage>
Component: wifiAssignee: ns-bugs <ns-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: kirillano
Priority: P5    
Version: ns-3-dev   
Hardware: All   
OS: All   
Attachments: the first version of patch
Second version of patch: changed names
Final version of patch
Full patch: bugfix + test-cases
Infinite loop in shedule method example

Description Mathieu Lacage 2009-04-24 08:43:16 UTC
see http://mailman.isi.edu/pipermail/ns-developers/2009-April/005808.html

To quote the original report:

---------------------------------
MacLow sends a packet while ACK timeout has not expired (In 802.11 standard in chapter 9.2.8 was said, that we must wait ack timeout before next transmission).
I have added NS_ASSERT(false) after line, which cancels ACK timeout in CancelAllEvents () method in MacLow, which is called from StartTransmission () and my simulation has failed with tis assert.
When I have looked through output more detailed - I have seen that a packet from another queue (beacon queue) has accessed the medium before ACK timeout of data packet has expired after crash of ACK. I think, that the problem may be resolved by adding a moment of time when AckTimeout expires in DcfManager. Now DcfManager knows nothing about AckTimeout and can grant access before ACK timeout has expired.
---------------------------------

my reply:
---------------------------------
I assume that you reference 9.9.1.3-c.1/2 from 802.11-2007: in the
future, when you are arguing against invalid implementation behavior, it
would be nice to point me to the spec sections rather than make me dig
out the relevant items: it would save you a lot of time to convince me
that what you are describing is a real bug.

To answer your point, yes, I think that you are right and that I was
wrong: the DcfManager appears to be in need of some extra knowledge
about pending transmissions waiting for an ack: I suspect that you will
have to rename MacLowNavListener to MacLowDcfListener and extend it to
report WaitAck and GotAck events. If you do this, you will have to
provide first a new testcase in dcf-manager-test.cc to show that it
fails with the current implementation, and, then, that the new
implementation passes that new testcase.
---------------------------------

the referenced sections of the standard:
---------------------------------
On specific slot boundaries, each EDCAF shall make a determination to perform one and only one of the following functions:
   —    Initiate the transmission of a frame exchange sequence for that access function.

[...]

The specific slot boundaries at which exactly one of these operations shall be performed are defined as follows,
for each EDCAF:

[...]

When any other EDCAF at this STA transmitted a frame requiring acknowledgment, the earlier of
1) The end of the ACK-Timeout interval timed from the PHY_TXEND.confirm primitive, followed by AIFSN[AC] x aSlotTime + aSIFSTime – aRxTxTurnaroundTime of idle medium,    and
2) The end of the first AIFSN[AC] × aSlotTime – aRxTxTurnaroundTime of idle medium after SIFS (not necessarily medium idle during the SIFS duration, the start of the SIFS duration implied by the length in the PLCP header of the previous frame) when a PHY-RXEND.indication primitive occurs as specified in 9.2.8.
---------------------------------
Comment 1 Kirill V. Andreev 2009-04-24 15:22:57 UTC
Created attachment 429 [details]
the first version of patch

I have fixed this bug, but have not written a test for it. I would like to You to make somecomments about it. As soon as I modify dcf-manager-test, I will post another patch
Comment 2 Mathieu Lacage 2009-04-27 03:18:54 UTC
Comment on attachment 429 [details]
the first version of patch

I don't know if the basic idea of adding a couple of extra 'end' variables is correct: if the regression tests pass, then, it probably is fine.

Small details:

This should change should be completely unneeded, right ?

>-          if (tmp > Simulator::Now ())
>+          if (tmp.GetNanoSeconds () > Simulator::Now ().GetNanoSeconds ())

NS_LOG_UNCOND should not be used in the final patch:

>@@ -283,6 +283,7 @@
>   bool oneRunning = false;
>   if (m_normalAckTimeoutEvent.IsRunning ()) 
>     {
>+      NS_LOG_UNCOND(Simulator::Now ());
>       m_normalAckTimeoutEvent.Cancel ();
>       oneRunning = true;
>     }

How about prefixing these method names with "Notify" ? Don't forget to change the name of the type NavListeners+CI too.

>+void
>+MacLow::AckTimeoutStartNow (Time duration)
>+{
>+  for (NavListenersCI i = m_navListeners.begin (); i != m_navListeners.end (); i++) 
>+    {
>+      (*i)->AckTimeoutStart (duration);
>+    }
>+}
>+void
>+MacLow::AckTimeoutResetNow ()
>+{
>+  for (NavListenersCI i = m_navListeners.begin (); i != m_navListeners.end (); i++) 
>+    {
>+      (*i)->AckTimeoutReset ();
>+    }
>+}
>+void
>+MacLow::CtsTimeoutStartNow (Time duration)
>+{
>+  for (NavListenersCI i = m_navListeners.begin (); i != m_navListeners.end (); i++) 
>+    {
>+      (*i)->CtsTimeoutStart (duration);
>+    }
>+}
>+void
>+MacLow::CtsTimeoutResetNow ()
>+{
>+  for (NavListenersCI i = m_navListeners.begin (); i != m_navListeners.end (); i++) 
>+    {
>+      (*i)->CtsTimeoutReset ();
>+    }
>+}

Here: s/NavListeners/DcfListeners/

>-  typedef std::vector<MacLowNavListener *>::const_iterator NavListenersCI;
>-  typedef std::vector<MacLowNavListener *> NavListeners;
>+  typedef std::vector<MacLowDcfListener *>::const_iterator NavListenersCI;
>+  typedef std::vector<MacLowDcfListener *> NavListeners;
>   NavListeners m_navListeners;
Comment 3 Kirill V. Andreev 2009-04-27 06:35:21 UTC
I would like to tell some words about yhe following changes:
>-          if (tmp > Simulator::Now ())
>+          if (tmp.GetNanoSeconds () > Simulator::Now ().GetNanoSeconds ())
When I have not added this, the tmp was grater than Simulator::Now (), but when I have printed the moments of time, the nanoseconds value turned out to be equal. I have observed it because while (1) caused. That was strange a bit(I did not look at the code of Time reailzation and how operator > was implemented).


OK, all variables I will fix and attach another version of patch.
Comment 4 Mathieu Lacage 2009-04-27 06:49:11 UTC
(In reply to comment #3)
> I would like to tell some words about yhe following changes:
> >-          if (tmp > Simulator::Now ())
> >+          if (tmp.GetNanoSeconds () > Simulator::Now ().GetNanoSeconds ())
> When I have not added this, the tmp was grater than Simulator::Now (), but when
> I have printed the moments of time, the nanoseconds value turned out to be
> equal. I have observed it because while (1) caused. That was strange a bit(I

Time stores values with 64.64 precision and GetNanoSeconds throws away the fractional part. It would be nice to know exactly why that behavior is problematic in your use-case.
Comment 5 Kirill V. Andreev 2009-04-28 11:55:52 UTC
Created attachment 433 [details]
Second version of patch: changed names

I have still now way to solve the problem with comparsion between Time. I still have to compare nanoseconds values rather than compare Time. I will think about this problem.
Comment 6 Kirill V. Andreev 2009-04-29 05:50:28 UTC
Well, it seems to me that I know why it does not work without GetNanoSeconds () incomparsion.
When we compare Time and Time - we use the whole 128 bits of data, and when we call Schedule (...), it uses only 64 bits of data (it uses GetTimeStep ()), when it inserts an event.
So. it occurs that we schedule event immediately and catch while(1), but the moments of time we compared before were different, so the following code: (in DcfManager::DoRestartAccessTimeoutIfNeeded)
m_accessTimeout = Simulator::Schedule (expectedBackoffDelay,
                                     &DcfManager::AccessTimeout, this);

shall be executed all the time.
So, as I think, replacing GetNanoSeconds () with GetTimeStap() is true solution
Comment 7 Kirill V. Andreev 2009-04-29 06:04:48 UTC
Created attachment 435 [details]
Final version of patch
Comment 8 Mathieu Lacage 2009-05-02 04:46:28 UTC
(In reply to comment #7)
> Created an attachment (id=435) [details]
> Final version of patch
> 

ok, but as I said before, I won't apply any patch to this code without an associated reduced testcase/patch in dcf-manager-test.cc. 
Comment 9 Mathieu Lacage 2009-05-19 09:42:27 UTC
any news on trying to get a reduced testcase ?
Comment 10 Kirill V. Andreev 2009-05-20 11:47:20 UTC
(In reply to comment #9)
> any news on trying to get a reduced testcase ?
> 

Well, I was away all last week, and I need to pass some exams. But as soon as I can returtn to work with NS, I will write a testcase (I think, I need about a week to do it). Sorry for delay
Comment 11 Kirill V. Andreev 2009-06-04 07:54:31 UTC
Created attachment 455 [details]
Full patch: bugfix + test-cases

Test cases are described in picture in comments. All previous tests are filled with ack-timeout starting and cancelling
Comment 12 Mathieu Lacage 2009-06-08 04:11:32 UTC
(In reply to comment #11)
> Created an attachment (id=455) [details]
> Full patch: bugfix + test-cases
> 
> Test cases are described in picture in comments. All previous tests are filled
> with ack-timeout starting and cancelling
> 

Thank you very much: this looks totally awesome. A couple more comments (the idea of touching that piece of code makes me crazy which is why I am so super careful):

1) ./waf --regression does not pass: I am looking into it but you can beat me to it.

2) I know that you are going to hate me but if I remove your calls to GetTimeStep, "./waf check" still passes ("./waf --regression hangs"). If you really want me to take these 2 lines in the patch, you would have to put together a testcase which fails without these 2 lines so that we understand really why they are here.
Comment 13 Mathieu Lacage 2009-06-08 04:14:48 UTC
(In reply to comment #12)

> 1) ./waf --regression does not pass: I am looking into it but you can beat me
> to it.

here is the change in output:
--- a   2009-06-08 10:13:20.000000000 +0200
+++ b   2009-06-08 10:13:44.000000000 +0200
@@ -3,10 +3,10 @@
 02:00:00.000291 291us tsft 6.0 Mb/s 2437 MHz (0x00c0) Acknowledgment RA:00:00:00:00:00:04 (oui Ethernet) 
 02:00:00.000369 369us tsft 6.0 Mb/s 2437 MHz (0x00c0) Assoc Response AID(0) :: Succesful
 02:00:00.000513 513us tsft 6.0 Mb/s 2437 MHz (0x00c0) -50dB signal -66dB noise Acknowledgment RA:00:00:00:00:00:03 (oui Eth
ernet) 
-02:00:00.000950 950us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36
.0 48.0 54.0 Mbit]
-02:00:00.000966 966us tsft 6.0 Mb/s 2437 MHz (0x00c0) Acknowledgment RA:00:00:00:00:00:05 (oui Ethernet) 
-02:00:00.001044 1044us tsft 6.0 Mb/s 2437 MHz (0x00c0) Assoc Response AID(0) :: Succesful
-02:00:00.001188 1188us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB noise Acknowledgment RA:00:00:00:00:00:03 (oui E
thernet) 
+02:00:00.000965 965us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36
.0 48.0 54.0 Mbit]
+02:00:00.000981 981us tsft 6.0 Mb/s 2437 MHz (0x00c0) Acknowledgment RA:00:00:00:00:00:05 (oui Ethernet) 
+02:00:00.001059 1059us tsft 6.0 Mb/s 2437 MHz (0x00c0) Assoc Response AID(0) :: Succesful
+02:00:00.001203 1203us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB noise Acknowledgment RA:00:00:00:00:00:03 (oui E
thernet) 
 02:00:00.508948 508948us tsft 6.0 Mb/s 2437 MHz (0x00c0) -48dB signal -101dB noise 00:00:00:00:00:07 (oui Ethernet) > 00:00
:00:00:00:04 (oui Ethernet) SNAP Unnumbered, ui, Flags [Command], length 524
 02:00:00.508964 508964us tsft 6.0 Mb/s 2437 MHz (0x00c0) Acknowledgment RA:00:00:00:00:00:04 (oui Ethernet) 
 02:00:00.517140 517140us tsft 6.0 Mb/s 2437 MHz (0x00c0) -48dB signal -101dB noise 00:00:00:00:00:07 (oui Ethernet) > 00:00
:00:00:00:04 (oui Ethernet) SNAP Unnumbered, ui, Flags [Command], length 524

Comment 14 Mathieu Lacage 2009-06-08 04:28:24 UTC
This is the only meaningful change: the assoc req is sent exactly 15us later

-02:00:00.000950 950us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
+02:00:00.000965 965us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
Comment 15 Mathieu Lacage 2009-06-08 05:42:59 UTC
(In reply to comment #14)
> This is the only meaningful change: the assoc req is sent exactly 15us later
> 
> -02:00:00.000950 950us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
> noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
> +02:00:00.000965 965us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
> noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
> 

In both cases, we observe the same scenario: ":05" sends assoc req twice to ":03", first fails with ack timeout because ":03" is busy receiving another req from ":04", second is the one we see here. the difference between the two scenarios is 15.666us.

Comment 16 Kirill V. Andreev 2009-06-08 06:22:13 UTC
Created attachment 457 [details]
Infinite loop in shedule method example

I have written a short example, which repeats inifinite loop situation in wifi. Shedule simply ignores precision less than 1 nanosecond (GetTimeStep -- see line 183 in rc/simulator/default-simulator-impl.cc). Similar sitation was casued when Ack timeout expires.
Comment 17 Mathieu Lacage 2009-06-08 07:17:48 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > This is the only meaningful change: the assoc req is sent exactly 15us later
> > 
> > -02:00:00.000950 950us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
> > noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
> > +02:00:00.000965 965us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
> > noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
> > 
> 
> In both cases, we observe the same scenario: ":05" sends assoc req twice to
> ":03", first fails with ack timeout because ":03" is busy receiving another req
> from ":04", second is the one we see here. the difference between the two
> scenarios is 15.666us.
> 

Ok, this change is indeed compatible with the specification I believe: 15.666 =
9 (slot) + 2*1000 / 300000000 (round trip of 1000m if speed of light is
300000000m/s for max propagation delay) is the extra time which an ack timeout
lasts compared to the nav setting.
Comment 18 Mathieu Lacage 2009-06-08 07:39:23 UTC
changeset b2654e0f071d

The patch I finally pushed gets rid of the GetTimeStep calls and, instead, modifies wifi-mac.cc: the problem was simply that all these values were expected to be rounded to microseconds

diff -r 3135a5c24358 -r b2654e0f071d src/devices/wifi/wifi-mac.cc
--- a/src/devices/wifi/wifi-mac.cc	Mon Jun 08 09:43:47 2009 +0200
+++ b/src/devices/wifi/wifi-mac.cc	Mon Jun 08 13:37:30 2009 +0200
@@ -63,7 +63,7 @@
   */
   Time ctsTimeout = GetDefaultSifs ();
   ctsTimeout += GetDefaultCtsAckDelay ();
-  ctsTimeout += GetDefaultMaxPropagationDelay () * Scalar (2);
+  ctsTimeout += MicroSeconds (GetDefaultMaxPropagationDelay ().GetMicroSeconds () * 2);
   ctsTimeout += GetDefaultSlot ();
   return ctsTimeout;
 }