Bug 2247 - Incorrect TCP timeouts during Fast Recovery
Incorrect TCP timeouts during Fast Recovery
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: tcp
ns-3-dev
Mac Intel Mac OS
: P5 major
Assigned To: natale.patriciello
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-14 13:17 UTC by l.salameh
Modified: 2016-02-22 11:08 UTC (History)
4 users (show)

See Also:


Attachments
Example script for tcp timeouts. (9.52 KB, text/x-csrc)
2015-12-14 13:17 UTC, l.salameh
Details
Time Sequence plot after running the example. (82.96 KB, image/jpeg)
2015-12-14 13:18 UTC, l.salameh
Details
Modified example script, larger link layer queue. (9.50 KB, text/x-csrc)
2015-12-14 14:01 UTC, l.salameh
Details
Time sequence plot of modified script. (105.90 KB, image/jpeg)
2015-12-14 14:02 UTC, l.salameh
Details
test-tcp-parameters.cc (7.52 KB, text/x-csrc)
2016-01-10 16:27 UTC, l.salameh
Details
Proposed patch for dupacks and timeouts. (1.25 KB, text/plain)
2016-01-13 12:33 UTC, l.salameh
Details
Slightly edited patch (2.29 KB, patch)
2016-01-19 10:06 UTC, natale.patriciello
Details | Diff
Limited transmit also for the CA_OPEN->CA_REORDER transition (31.93 KB, patch)
2016-01-25 05:13 UTC, natale.patriciello
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description l.salameh 2015-12-14 13:17:58 UTC
Created attachment 2206 [details]
Example script for tcp timeouts.

Run attached script to see this bug in action. (sorry i should have cleaned the example a bit...).

Consider the following scenario:

[client] -- [switch] -- [server]

The client -> switch link has half the data rate of the switch -> server link. We size the link layer queues so that we expect drops on the switch -> client link.

See attached time sequence plot for the result of running the example.

When we get drops and dup acks, we go to fast retransmit, and as we recover one dropped packet every RTT NewReno, the first partial ack resets the retransmit timer, the subsequent ones do not. I don't think this behaviour is correct, and it is causing the timeout shown in the time sequence plot to occur, resulting in the retransmit of a data segment the sender has only just transmitted.

The code in question is in tcp-socket-base.cc, TcpSocketBase::ReceivedAck

if (tcpHeader.GetAckNumber () < m_recover)
{
    if (m_isFirstPartialAck)
   {
      m_isFirstPartialAck = false;
   }
   else
   {
     resetRTO = false;
   }

}

.
.
.
 NewAck (tcpHeader.GetAckNumber (), resetRTO);


New ack resets the RTO if the flag is passed to it. In fast recovery, when receiving partial acks, the RTO is never reset after the first new ack, and we have timeouts.
Comment 1 l.salameh 2015-12-14 13:18:55 UTC
Created attachment 2207 [details]
Time Sequence plot after running the example.
Comment 2 Tom Henderson 2015-12-14 13:40:09 UTC
The policy for resetting the retransmit timeout in RFC 6582 is

       For the first partial ACK that arrives during fast recovery, also
       reset the retransmit timer.  Timer management is discussed in
       more detail in Section 4.

In previous versions of the RFC, there were some policy choices (so-called "impatient" vs "slow-but-steady" policies) on this timer reset:

https://tools.ietf.org/html/rfc3782#section-4

but one thing that was done when RFC6582 went standards track was to pick from among the policy options in earlier versions of the RFC, and Impatient variant was chosen for this policy.

So I think that the current implementation is conformant.  Could you please review in light of this and let us know whether you think there is still a bug?
Comment 3 l.salameh 2015-12-14 14:01:08 UTC
Created attachment 2208 [details]
Modified example script, larger link layer queue.
Comment 4 l.salameh 2015-12-14 14:02:00 UTC
Created attachment 2209 [details]
Time sequence plot of modified script.
Comment 5 l.salameh 2015-12-14 14:14:12 UTC
I see. Yes, you are right, the implementation should go to slow start after the first partial ack in the "impatient" policy.

But I still seem to have a problem ... I've attached a modified script and it's output time sequence plot. The change i've made is to increase the link layer queue size (larger than the bandwidth delay product) and make the on-off application send a large amount of data immediately.

If you take a look at the time sequence plot, it seems that TCP is behaving very strangely. On receiving a retransmit timeout while in fast recovery, after the first retransmit, it seems like TCP attempts to go into slow start again as per the RFC. But instead, the implementation dumps a really large window (the vertical line of x's in the time sequence plot) onto the network.

This was the original bug I encountered before I tried to narrow it down. I thought the culprit was the RTO during fast retransmit. Now, I'm not sure why this is happening. I'm hoping this bug is reproducible with the script I've provided.
Comment 6 l.salameh 2015-12-14 14:34:27 UTC
Ok, due to the large buffers, there are still duplicate acks in flight, received after the initial RTO. These are causing the TCP state machine to go right back into fast retransmit after the timeout, even though it probably should continue slow start. And somewhere along the line, I'm not sure how yet, the congestion window value is set to something large and incorrect.
Comment 7 natale.patriciello 2015-12-14 16:05:44 UTC
Hi, have you considered cwnd inflation? For each duplicated ack, in particolar conditions, the cwnd is artificially inflated to allow a segment to be transmitted. This is why you see such a spike. 

By enabling debug messages, you should be able to see what's happening inside ; this could help. Tomorrow I will try the script.
Comment 8 l.salameh 2015-12-14 17:04:21 UTC
Hi, yes but cwnd inflation should not change the number of packets in flight. After you halve the window on fast retransmit, suppose you have a cwnd = 4, even if you inflate the cwnd to be larger than 4 to continue the flow of data as duplicate acks come back, the number of packets in flight should still be 4.

The graph I sent does not exhibit a normal cwnd inflation. In fact, it seems that the window size is set incorrectly after TCP attempts to recover from fast retransmit.

I will try to debug this further and let you know what I find.
Comment 9 l.salameh 2015-12-15 08:47:39 UTC
Ok, so this is the order of events that are happening which are affecting cwnd:

1. After TCP hits its first batch of losses, we get 3 dup acks, and go into fast recovery. cwnd = ssThresh = BytesInFlight/2.

2. As we continue to receive dup acks in fast recovery, cwnd is inflated by a segment size each time. Note that the actual bytes in flight do not change, and should equal to sshThresh. We receive partial acks that deflate the cwnd.

3. Because we have many losses, eventually the retransmit timeout will fire in this impatient scheme. When the timeout fires, we reset the cwnd = segmentsize.

4. The setup I have in the script has a large RTT, and a large queue at the client. Therefore, even though we've gone back to slow start, we are still receiving Dup acks from the fast recovery phase.

5. This TCP implementation goes back to the fast recovery phase when it receives these Dup Acks, event though it should probably continue with slow start.

6. On entering the fast recovery phase again, TCP attempts to set 
cwnd = ssThresh = BytesInFlight/2. Now, the implementation of TCPSocketBase::BytesInFlight is as follows:

uint32_t
TcpSocketBase::BytesInFlight () const
{
  NS_LOG_FUNCTION (this);
  return m_highTxMark.Get () - m_txBuffer->HeadSequence ();
}

Recall that we had just exited fast recovery where the high tx mark had been artificially inflated to allow TCP to progress.

Therefore, the BytesInFlight() value is in fact INCORRECT. It is reporting an inflated window, when in fact our cwnd = segmentsize, because we had just timed out.

This is the reason we see the incorrect large spike in packets.


To sum up, the problem is twofold:

1. It seems like we need an extra TCPSocketState, which indicates going into slow start after we get an RTO in the fast retransmit state. While in this state, TCP will ignore Dup acks, and continue slow start. It will only exit the state when the ack number received is greater than the m_recover value recorded at the beginning of the fast retransmit state. 

2. The BytesInFlight() function should actually report the REAL number of bytes in flight, rather than use the tx buffer for this calculation. It will report an incorrect number when we have window inflation and therefore will cause many problems other than the one reported here. In fact, in the past I had encountered a problem where consecutive fast recovery instances were setting the cwnd incorrectly just because of this artifact.
Comment 10 natale.patriciello 2015-12-15 15:57:47 UTC
Really nice considerations. We need to remove the inflation.. It creates many problems and corner cases; I think the steps should involve the addition of a scoreboard. I have two other bugs to take care of, but I am happy about that, because following gsoc many problems have been investigated; in fact, they were present in older versions, but pratically impossible to debug. 

Anyway, the scoreboard is present in every implementation, and it will be useful both for this bug and in the future for SACK.
Comment 11 Tom Henderson 2015-12-21 23:04:59 UTC
(In reply to l.salameh from comment #9)
> Ok, so this is the order of events that are happening which are affecting
> cwnd:
> 
> 1. After TCP hits its first batch of losses, we get 3 dup acks, and go into
> fast recovery. cwnd = ssThresh = BytesInFlight/2.
> 
> 2. As we continue to receive dup acks in fast recovery, cwnd is inflated by
> a segment size each time. Note that the actual bytes in flight do not
> change, and should equal to sshThresh. We receive partial acks that deflate
> the cwnd.
> 
> 3. Because we have many losses, eventually the retransmit timeout will fire
> in this impatient scheme. When the timeout fires, we reset the cwnd =
> segmentsize.
> 
> 4. The setup I have in the script has a large RTT, and a large queue at the
> client. Therefore, even though we've gone back to slow start, we are still
> receiving Dup acks from the fast recovery phase.
> 
> 5. This TCP implementation goes back to the fast recovery phase when it
> receives these Dup Acks, event though it should probably continue with slow
> start.
> 
> 6. On entering the fast recovery phase again, TCP attempts to set 
> cwnd = ssThresh = BytesInFlight/2. Now, the implementation of
> TCPSocketBase::BytesInFlight is as follows:
> 
> uint32_t
> TcpSocketBase::BytesInFlight () const
> {
>   NS_LOG_FUNCTION (this);
>   return m_highTxMark.Get () - m_txBuffer->HeadSequence ();
> }
> 
> Recall that we had just exited fast recovery where the high tx mark had been
> artificially inflated to allow TCP to progress.
> 
> Therefore, the BytesInFlight() value is in fact INCORRECT. It is reporting
> an inflated window, when in fact our cwnd = segmentsize, because we had just
> timed out.
> 
> This is the reason we see the incorrect large spike in packets.
> 
> 
> To sum up, the problem is twofold:
> 
> 1. It seems like we need an extra TCPSocketState, which indicates going into
> slow start after we get an RTO in the fast retransmit state. While in this
> state, TCP will ignore Dup acks, and continue slow start. It will only exit
> the state when the ack number received is greater than the m_recover value
> recorded at the beginning of the fast retransmit state. 

This is covered by the specification in RFC6582

       When the third duplicate ACK is received, the TCP sender first
       checks the value of recover to see if the Cumulative
       Acknowledgment field covers more than recover.

So we need to maintain the recover variable through the timeout event and check this before entering fast retransmit again.

I suspect it is not checking because line 1393 of tcp-socket-base.cc is:

          else if (m_dupAckCount == m_retxThresh)


> 
> 2. The BytesInFlight() function should actually report the REAL number of
> bytes in flight, rather than use the tx buffer for this calculation. It will
> report an incorrect number when we have window inflation and therefore will
> cause many problems other than the one reported here. In fact, in the past I
> had encountered a problem where consecutive fast recovery instances were
> setting the cwnd incorrectly just because of this artifact.

For this, the scoreboard/Pipe() of RFC 6675 
https://tools.ietf.org/html/rfc6675
may be something to work towards.
Comment 12 l.salameh 2016-01-06 16:50:19 UTC
Just wanted to point out that this bug is a major blocker for TCP, and should be treated as such, given the upcoming release of ns-3.25. I understand that TCP in ns-3 has undergone a major redesign, but it seems like it needs to go through more thorough testing before it can be ready to use.
Comment 13 natale.patriciello 2016-01-08 12:04:54 UTC
Hi, I'm aware of the importance of this bug.

I'm working to provide a list of segment transmitted, in order to provide a more reliable value of BytesInFlight () to almost "fix" this bug for the ns-3.25 release.

Then, I'd like to remove inflation and deflation, to make more smooth graph of cWnd.

If you want to help, we can coordinate.

nat
Comment 14 natale.patriciello 2016-01-09 12:46:52 UTC
(In reply to l.salameh from comment #9)
> 2. The BytesInFlight() function should actually report the REAL number of
> bytes in flight, rather than use the tx buffer for this calculation. It will
> report an incorrect number when we have window inflation and therefore will
> cause many problems other than the one reported here. In fact, in the past I
> had encountered a problem where consecutive fast recovery instances were
> setting the cwnd incorrectly just because of this artifact.

Ok, for this we have a fix.

I rewrote the TcpTxBuffer in order to be able to count the number of bytes not yet acknowledged, nor discarded. This equals m_highTxMark.Get () - m_txBuffer->HeadSequence () when there are not retransmissions, but it reports the correct value (as you pointed out) when retransmissions comes into play.

BytesInFlight in TcpTxBuffer returns the number of bytes requested by TcpSocketBase but not yet discarded; see https://www.nsnam.org/bugzilla/﷒0﷓.

Please, can you test this commit [1] ? And remove the assert, of course. For example, with that assert, I discovered that the old BytesInFlight does overflow in some situation (e.g ns3-tcp-test-state). You can safely use the tcp-sack branch (despite the name, it does not have tcp-sack).

[1] https://github.com/kronat/ns-3-dev-git/commit/0de8dbe533cf003d6b47290d99cd2a03eaee2720
Comment 15 natale.patriciello 2016-01-09 13:21:24 UTC
(In reply to l.salameh from comment #9)

> 5. This TCP implementation goes back to the fast recovery phase when it
> receives these Dup Acks, event though it should probably continue with slow
> start.

I noted this only now. IMHO is incorrect, since if we are in CA_LOSS, dupacks are ignored (see ReceivedAck, starting from line 1391). If we are in CA_OPEN, it means we had received a cumulative ACK, which moves forward the SND.UNA pointer. Then, if there are dupacks and we enter in CA_RECOVERY, the bytesInFlight are calculated starting from the new SND.UNA.

What do you think?
Comment 16 l.salameh 2016-01-10 16:26:15 UTC
Hi Natale,

I checked out the tcp-sack branch and ran the patched code with my original script, and I'm still seeing the same behavior when I plot the time sequence plots, with the large vertical spike of packets. I realized that the original script I attached to this bug doesn't actually compile, so I've added a fixed version. I used matlab to plot the time sequence graphs from the output trace file, and I would be happy to supply you with those if you need them.

Unfortunately, I'm a bit busy the next couple of days with phd stuff, but I'll be back towards the end of the week, and I would be glad to collaborate with you on fixing this bug. As it stands, I only ran the script with the new code, and haven't had the chance to investigate why the bug is still happening.

As for your remark below, if you run the new test script I've attached to this bug, and add break points in the ReceivedAck() function, and whenever the state changes to CA_OPEN/CA_RECOVERY etc. you will see the problem in action.

Under a low RTT network, the code will work correctly, and the next cumulative ACK should correctly set the state back to CA_OPEN during a CA_LOSS. But in the script I provided, we have high RTT and a delayed feedback of a sort. During the fast recovery phase, we had inflated the window and retransmitted some packets, so we should get some Dup Acks and a cumulative ACK for the correctly retransmitted packets. Unfortunately, these arrive AFTER we have timed out, which messes up the state machine. 

Under these conditions, even though we've timed out and we're starting to retransmit all the packets, we immediately go into fast recovery again because the states transition from :

CA_RECOVERY -> CA_LOSS (on timeout) -> CA_OPEN (on receiving a Cumulative ACK from a retransmitted packet from the previous fast recovery phase)  -> CA_RECOVERY (on receiving 3 dup acks from the previous fast recovery phase)

Maybe this is the correct behavior, but somehow it seems wrong to me, since we immediately abort slow start and go into fast recovery again. I should probably check the rfcs. 

In the meantime, you can download the fixed script named "test-tcp-parameters.cc", and confirm that even with the fix, you can reproduce the large buffer and incorrect state bugs on your end.

> > 5. This TCP implementation goes back to the fast recovery phase when it
> > receives these Dup Acks, event though it should probably continue with slow
> > start.
> 
> I noted this only now. IMHO is incorrect, since if we are in CA_LOSS,
> dupacks are ignored (see ReceivedAck, starting from line 1391). If we are in
> CA_OPEN, it means we had received a cumulative ACK, which moves forward the
> SND.UNA pointer. Then, if there are dupacks and we enter in CA_RECOVERY, the
> bytesInFlight are calculated starting from the new SND.UNA.
> 
> What do you think?
Comment 17 l.salameh 2016-01-10 16:27:42 UTC
Created attachment 2225 [details]
test-tcp-parameters.cc
Comment 18 natale.patriciello 2016-01-12 05:18:51 UTC
Ah-ha! I got it.

The most immediate, easy and reasonable fix is to ignore dupacks which reports a timestamp option echo value which is less than the timestamp of the first retransmission. However this does not work without timestamp.. mmm.

Fixing bug 2068 could help in this, since we now are underestimating the rtt in cases of losses. I'm concentrating now on correcting 2068, hoping that this will alleviate the situation.

Stay tuned.
Comment 19 Tom Henderson 2016-01-12 09:57:54 UTC
(In reply to natale.patriciello from comment #18)
> Ah-ha! I got it.
> 
> The most immediate, easy and reasonable fix is to ignore dupacks which
> reports a timestamp option echo value which is less than the timestamp of
> the first retransmission. However this does not work without timestamp.. mmm.
> 


See my comment #11 above.  This is the purpose of the "recover" variable in the RFC, to suppress fast retransmission/recovery immediately after a timeout.

I suggest to try fixing this by maintaining a recover variable rather than try to avoid it with timestamps or other heuristics.
Comment 20 l.salameh 2016-01-13 12:00:04 UTC
After investigating the fixes to the TcpTxBuffer in the tcp-sack branch, the problem still persists.

The issue here is that the TcpTxBuffer's m_sentSize, which is the value returned by BytesInFlight() in this fix behaves as follows:

1. It's increased by the segment size each time we send a segment.
2. Decreased by the bytes acknowledged by a full ack.

When we are in fast recovery and have inflation, we are inflating this number and not decreasing it accordingly when we get duplicate acks.

The problem is therefore appearing when we have a first period of fast recovery, followed immediately by a second one. 

I propose the following fix, which was taken from ns-2. In tcp-socket-base.cc, ReceivedAck(), line 1416, 

replace :

m_tcb->m_ssThresh = m_congestionControl->GetSsThresh (m_tcb, BytesInFlight ());


with:

uint32_t flightSize = std::min(BytesInFlight (), m_tcb->m_cWnd.Get());

m_tcb->m_ssThresh = m_congestionControl->GetSsThresh (m_tcb, flightSize);

This guarantees that we never attempt to set the ssThresh to the inflated value reported by BytesInFlight() the second time we go into fast recovery. I've ran my test script with this fix, and the problem seems to have been resolved. Obviously this should be tested more thoroughly.

The fix is just a workaround: the better solution would be to use the scoreboard/pipe mechanism of SACK as Tom has suggested.
Comment 21 l.salameh 2016-01-13 12:33:57 UTC
Created attachment 2235 [details]
Proposed patch for dupacks and timeouts.
Comment 22 l.salameh 2016-01-13 12:39:04 UTC
I have a quick patch for the timeouts problem. I only tested it with my problematic script, but thought I will put it here for you guys to check. See the most recent file upload.

In short it does what is in RFC 6582 (I hope). It adds

m_recover = m_highTxMark;

to TcpSocketBase's ReTxTimeout () method, and checks m_recover when we get 3 duplicate ACKs:

if ((m_dupAckCount == m_retxThresh) && ((m_highRxAckMark - 1) > m_recover))


Removed the assert that checked if we moved to a new state if we get 3 duplicate acks.
Comment 23 natale.patriciello 2016-01-19 10:06:58 UTC
Created attachment 2242 [details]
Slightly edited patch

I've updated the patch to remain in limited transmit if m_dupAck > 3 but we not entered in fast recovery.

The patch seems to follow the RFC, but I think the RFC is going extremely over on what it is supposed to do.

This because, when the RTO expires, the m_highTxMark could be as high as we inflated the window. In your script, RTO is for 87121, and we set m_recover to 142297. We don't enter fast recovery anymore until 142297 is acked.

Anyway, this seems to conform to RFC. We misses the ACK Heuristic to infer losses, but this should be a new feature request.

If you are ok with this patch, I'll push.
Comment 24 l.salameh 2016-01-19 11:09:34 UTC
Hi Natale,

Yes, the edited patch looks good and is working correctly with my code. You can go ahead and push it if it passes the TCP tests and Tom has no objections. 

> The patch seems to follow the RFC, but I think the RFC is going extremely
> over on what it is supposed to do.
> 
> This because, when the RTO expires, the m_highTxMark could be as high as we
> inflated the window. In your script, RTO is for 87121, and we set m_recover
> to 142297. We don't enter fast recovery anymore until 142297 is acked.
> 

This is actually the behaviour we want, because we really want to recover the lost flight of packets using the timeout retransmission we just invoked, so we should probably not try to go into fast retransmit again :).
 
We still need to deal with the problem of the incorrect CWND which is set when we get two consecutive fast retransmit scenarios. See my comment #20.

Cheers,
Lynne

(In reply to natale.patriciello from comment #23)
> Created attachment 2242 [details]
> Slightly edited patch
> 
> I've updated the patch to remain in limited transmit if m_dupAck > 3 but we
> not entered in fast recovery.
> 
> The patch seems to follow the RFC, but I think the RFC is going extremely
> over on what it is supposed to do.
> 
> This because, when the RTO expires, the m_highTxMark could be as high as we
> inflated the window. In your script, RTO is for 87121, and we set m_recover
> to 142297. We don't enter fast recovery anymore until 142297 is acked.
> 
> Anyway, this seems to conform to RFC. We misses the ACK Heuristic to infer
> losses, but this should be a new feature request.
> 
> If you are ok with this patch, I'll push.
Comment 25 Tom Henderson 2016-01-19 13:02:34 UTC
(In reply to natale.patriciello from comment #23)
> Created attachment 2242 [details]
> Slightly edited patch
> 
> I've updated the patch to remain in limited transmit if m_dupAck > 3 but we
> not entered in fast recovery.
> 
> The patch seems to follow the RFC, but I think the RFC is going extremely
> over on what it is supposed to do.

The reason that it suppresses further fast retransmissions is that some retransmissions after a timeout will cause dupacks because they are retransmissions of segments that have already been received.  If you repeatedly re-enter fast retransmit due to many holes in the receiver buffer, your ssthresh will be whittled down to almost nothing.  In high bandwidth-delay product networks, the TCP flow is somewhat crippled as it must rebuild window only through congestion avoidance.  So the cost of not detecting future true fast retransmissions is balanced by the benefit of preserving ssthresh that is not too small.  Plus as you mention, it is possible to add some heuristics to detect new fast retransmissions and the RFC allows such heuristics.

> 
> This because, when the RTO expires, the m_highTxMark could be as high as we
> inflated the window. In your script, RTO is for 87121, and we set m_recover
> to 142297. We don't enter fast recovery anymore until 142297 is acked.
> 
> Anyway, this seems to conform to RFC. We misses the ACK Heuristic to infer
> losses, but this should be a new feature request.
> 
> If you are ok with this patch, I'll push.

The patch looks OK to me.

Also, can we add a test case testing proper values of ssthresh and cwnd after such event?
Comment 26 natale.patriciello 2016-01-21 05:49:11 UTC
Ok, I understand all the point now.

BTW, reading and re-reading the limited transit RFC I'm not sure about that:
(In reply to natale.patriciello from comment #23)
> I've updated the patch to remain in limited transmit if m_dupAck > 3 but we
> not entered in fast recovery.

Is that correct, or limited transmit only applies to the first 2 dupack ? RFC says:

   This document does not specify the sender's response to duplicate
   ACKs when the fast retransmit/fast recovery algorithm is not invoked.
   This is addressed in other documents, such as those describing the
   Limited Transmit procedure [RFC3042].

But in limited transmit, RFC talks about the first 2 dupAck only.

> Also, can we add a test case testing proper values of ssthresh and cwnd
> after such event?


You mean the fact that, after an RTO, we don't enter fast retransmission again until m_recover is acked ? Because the value of cWnd and ssTh for Fast Retransmission and RTO are already checked.
Comment 27 natale.patriciello 2016-01-22 10:38:34 UTC
pushed 11832:02e355cbbbf5

Thank you all
Comment 28 Tom Henderson 2016-01-22 12:17:06 UTC
(In reply to natale.patriciello from comment #26)
> Ok, I understand all the point now.
> 
> BTW, reading and re-reading the limited transit RFC I'm not sure about that:
> (In reply to natale.patriciello from comment #23)
> > I've updated the patch to remain in limited transmit if m_dupAck > 3 but we
> > not entered in fast recovery.
> 
> Is that correct, or limited transmit only applies to the first 2 dupack ?
> RFC says:
> 
>    This document does not specify the sender's response to duplicate
>    ACKs when the fast retransmit/fast recovery algorithm is not invoked.
>    This is addressed in other documents, such as those describing the
>    Limited Transmit procedure [RFC3042].
> 
> But in limited transmit, RFC talks about the first 2 dupAck only.

Sorry for late response.

I think that the way you have it now is OK (in the case that multiple dupacks are received in the period in which fast recovery is suppressed).

However, could you please check the logic for limited transmit?  It seems to me that you do not invoke the limited transmit upon the first dupack since you are in CA_OPEN at that time.

> 
> > Also, can we add a test case testing proper values of ssthresh and cwnd
> > after such event?
> 
> 
> You mean the fact that, after an RTO, we don't enter fast retransmission
> again until m_recover is acked ? Because the value of cWnd and ssTh for Fast
> Retransmission and RTO are already checked.

It is probably OK as is; the NS_LOG statements cover all of these variable settings at the state transmissions.
Comment 29 natale.patriciello 2016-01-23 12:57:04 UTC
(In reply to Tom Henderson from comment #28)
> However, could you please check the logic for limited transmit?  It seems to
> me that you do not invoke the limited transmit upon the first dupack since
> you are in CA_OPEN at that time.


You are right; I'm checking now this patch

diff -r 1862e8f860e2 src/internet/model/tcp-socket-base.cc
--- a/src/internet/model/tcp-socket-base.cc	Fri Jan 22 11:18:58 2016 -0800
+++ b/src/internet/model/tcp-socket-base.cc	Sat Jan 23 18:54:45 2016 +0100
@@ -1415,6 +1415,14 @@
                          m_dupAckCount << " dup ACKs");
           m_tcb->m_congState = TcpSocketState::CA_DISORDER;
 
+          if (m_limitedTx && m_txBuffer->SizeFromSequence (m_nextTxSequence) > 0)
+            {
+              // RFC3042 Limited transmit: Send a new packet for each duplicated ACK before fast retransmit
+              NS_LOG_INFO ("Limited transmit");
+              uint32_t sz = SendDataPacket (m_nextTxSequence, m_tcb->m_segmentSize, true);
+              m_nextTxSequence += sz;
+            }
+
           NS_LOG_DEBUG ("OPEN -> DISORDER");
         }
       else if (m_tcb->m_congState == TcpSocketState::CA_DISORDER)
@@ -1436,7 +1444,7 @@
                            m_tcb->m_ssThresh << " at fast recovery seqnum " << m_recover);
               DoRetransmit ();
             }
-          else if (m_limitedTx)
+          else if (m_limitedTx && m_txBuffer->SizeFromSequence (m_nextTxSequence) > 0)
             {
               // RFC3042 Limited transmit: Send a new packet for each duplicated ACK before fast retransmit
               NS_LOG_INFO ("Limited transmit");

but it breaks fast-retr-test (because now we transmit 3 segments instead of 2, and the test was balancing dupacks and delayed ack manually). Working on it.
Comment 30 natale.patriciello 2016-01-25 05:13:33 UTC
Created attachment 2253 [details]
Limited transmit also for the CA_OPEN->CA_REORDER transition

This patch inserts limited transmit also in such case. Testcase updated.
Comment 31 natale.patriciello 2016-02-05 10:51:32 UTC
Fixed in 11850 and 11843. I probably assigned it wrongly: i am deeply sorry for that, never happen in the future.
Comment 32 l.salameh 2016-02-10 07:58:20 UTC
Thanks Natale! Are there any plans to fix the BytesInFlight() bug? Or are we assuming that will be fixed when SACK becomes available?
Comment 33 l.salameh 2016-02-21 21:04:38 UTC
Unfortunately I have to reopen this bug. It seems the actual value of m_highTxMark is not what we have expected, which is maximum transmitted sequence number, but rather it is max transmitted sequence number + its size, i.e. from the function SendPendingData, we have:

// Update highTxMark
  m_highTxMark = std::max (seq + sz, m_highTxMark.Get ());

RFC 6582 assumes recover points to the highest sequence number transmitted, not the highest expected ack number. 

As a result, I am currently running into a bug where we are ignoring duplicate acks immediately after we recover from a timeout even though we should go into fast recovery.

The fix is in TcpSocketBase::ReceivedAck:


       else if (m_tcb->m_congState == TcpSocketState::CA_DISORDER)
         {
-          if ((m_dupAckCount == m_retxThresh) && ((m_highRxAckMark - 1) > m_recover))
+          if ((m_dupAckCount == m_retxThresh) && (m_highRxAckMark >= m_recover))
Comment 34 natale.patriciello 2016-02-22 09:20:21 UTC
(In reply to l.salameh from comment #33)
> The fix is in TcpSocketBase::ReceivedAck:
> 
> 
>        else if (m_tcb->m_congState == TcpSocketState::CA_DISORDER)
>          {
> -          if ((m_dupAckCount == m_retxThresh) && ((m_highRxAckMark - 1) >
> m_recover))
> +          if ((m_dupAckCount == m_retxThresh) && (m_highRxAckMark >=
> m_recover))


We had passed a lot of time discussing this.. without having a test case I'm diffident to push this change, since in the near future we maybe need to revert that again.

Anyway, thinking on rounded segment size (e.g. 1000), 

* if m_recover = 10000 (the 10th segment) and the highest Rx points to 9001, both are false;
* if m_recover = 10000 and highest Rx points to 10001, one is false and the other is true. In this case, we should go in fast recovery, right? For me yes, and your fix is correct.
Comment 35 natale.patriciello 2016-02-22 11:08:17 UTC
Fixed in 11894