Bug 418

Summary: RttEstimator::RetransmitTimeout can return a negative value
Product: ns-3 Reporter: Mathieu Lacage <mathieu.lacage>
Component: internetAssignee: Rajib Bhattacharjea <raj.b>
Status: RESOLVED FIXED    
Severity: normal CC: craigdo, ns-bugs, tomh
Priority: P1    
Version: pre-release   
Hardware: All   
OS: All   
Attachments: avoid crash
maybe a fix?
don't make negative estimates in rtt

Description Mathieu Lacage 2008-11-24 02:53:12 UTC
Created attachment 314 [details]
avoid crash

When it returns a negative value, the simulator crashes because that value is fed to Simulator::Schedule.
Attached patch makes sure that the returned value is not negative.
Comment 1 Mathieu Lacage 2008-12-10 02:37:42 UTC
again, waiting for approval from raj. reassigning.
Comment 2 Rajib Bhattacharjea 2008-12-11 13:16:50 UTC
The estimate should not be negative ever, and rounding negative values up to zero isn't the right fix.  I'm working on a fix...
Comment 3 Rajib Bhattacharjea 2008-12-11 13:39:39 UTC
Created attachment 332 [details]
maybe a fix?

Will you see if this works for you?

Also, provide a test case?
Comment 4 Mathieu Lacage 2008-12-11 13:52:18 UTC
(In reply to comment #3)
> Created an attachment (id=332) [details]
> maybe a fix?
> 
> Will you see if this works for you?
> 
> Also, provide a test case?

The testcase is coming from the ns-3-simu tree which is broken at the moment so, I can't test this.

Comment 5 Rajib Bhattacharjea 2008-12-12 00:39:35 UTC
I see a path to a simple test case.  It involves lowering the delay of the PointToPointChannel between RTT measurements.  I am not having luck with:

Simulator::Schedule
  (Seconds(7),
   &Config::SetDefault,
   "ns3::PointToPointChannel::Delay",
   TimeValue (MilliSeconds(250)));

Any ideas?
Comment 6 Mathieu Lacage 2008-12-12 02:11:45 UTC
(In reply to comment #5)
> I see a path to a simple test case.  It involves lowering the delay of the
> PointToPointChannel between RTT measurements.  I am not having luck with:
> 
> Simulator::Schedule
>   (Seconds(7),
>    &Config::SetDefault,
>    "ns3::PointToPointChannel::Delay",
>    TimeValue (MilliSeconds(250)));
> 
> Any ideas?

use the loopback interface for the server and client: This is what I have been doing in my ns-3-simu tests

> 

Comment 7 Rajib Bhattacharjea 2008-12-12 15:23:48 UTC
Comment on attachment 332 [details]
maybe a fix?

This patch is no good either.  Its redundant (the absolute value of err is already taken).
Comment 8 Rajib Bhattacharjea 2008-12-13 23:49:08 UTC
In the interest of closing this, I applied a modified version of the patch which makes the minimum RTO value a configurable attribute.  I'd like to investigate this further though; the value being less than zero means there are bigger problems in the logic (i.e. it should never happen).

Mathieu, a concrete test case would be great.  I had no luck with the loopback device.

4023:d320dea20aca
Comment 9 Mathieu Lacage 2008-12-14 11:51:21 UTC
testcase:
http://code.nsnam.org/mathieu/ns-3-simu
cd ns-3-simu
hg backout --merge -r 4379
./waf
./waf --shell
./build/debug/examples/process-bench --Benchmark=tcp --Iterations=1000
assert failed. file=../src/simulator/default-simulator-impl.cc, line=189, cond="tAbsolute >= TimeStep (m_currentTs)"



enjoy
Comment 10 Craig Dowell 2008-12-14 15:07:35 UTC
The problem is that that in RttMeanDeviation::Measurement an estimated RTT is made that is negative.  The variable <est> needs to be clamped to be zero or positive.  You can watch the control loop tick down and go negative as it approaches an estimate of 0.

Code should be more like,

  Time err = m - est;
  est = est + Scalar (gain) * err; // estimated rtt
  if (est.IsNegative ())
    {
      est = Seconds (0.); // don't make negative estimates
    }


Comment 11 Craig Dowell 2008-12-14 15:15:55 UTC
Created attachment 335 [details]
don't make negative estimates in rtt
Comment 12 Rajib Bhattacharjea 2008-12-14 15:21:30 UTC
What's wrong with the patch I applied to ns-3-dev?
Comment 13 Rajib Bhattacharjea 2008-12-14 15:23:43 UTC
@Craig:

The logic implemented is from RFC2988 and a Van Jacobson paper, and the RTO should NEVER return a negative value unless the current estimate, or the measurement, is negative (which should never happen with the correct logic in a causal simulator).  Look carefully at the math.  Regardless of gain [0,1],  the new estimate should be a weighted average of m and the current estimate.

My point is that the fact that RTO<0 at all implies that we are failing to implement the logic correctly.
Comment 14 Craig Dowell 2008-12-14 15:55:22 UTC
> The logic implemented is from RFC2988 and a Van Jacobson paper, and the RTO
> should NEVER return a negative value unless the current estimate, or the
> measurement, is negative (which should never happen with the correct logic 
> in a causal simulator).  Look carefully at the math.  Regardless of gain
> [0,1],  the new estimate should be a weighted average of m and the current
> estimate.

I started by looking at the math, but I didn't immediately see how the estimate could become zero, so I looked in a debugger since I didn't have a great handle on possible numeric errors.  Despite what I seen in the math, in the program the current estimate becomes negative.

In Mathieu's case, every measurement is 0 ns.  What happens when I watch the control loop is that it begins converging toward zero seconds. Close to the edge, you have the estimate finally at a point where it displays 0 ns (perhaps this really means within epsilon of 0).  The error is calculated as m - est which "the math" says should be 0-0 = 0.  This is displayed as -1 ns so there is at least some tiny error present.  This tiny error persists in the calculations, but the variance now begins dropping toward zero.  Eventually 4 times the variance becomes less than est and RetransmitTimeout returns a (timy) negative number.

> My point is that the fact that RTO<0 at all implies that we are failing to
> implement the logic correctly.

Perhaps the logic is fine, but we aren't admitting the possibility of vanishingly small numeric errors?
Comment 15 Craig Dowell 2008-12-14 16:12:33 UTC
That should have read,

  I started by looking at the math, but I didn't immediately see how the
  estimate could become *less than* zero, so I looked in a debugger since I
  didn't have a great handle on possible numeric errors.

Anyway, this is "precisely" the kind of situation where floating point can inflict nasty wounds ...
Comment 16 Craig Dowell 2008-12-14 17:38:30 UTC
I'm holding up ns-3.3-RC4 until a resolution of this issue.  I'd like to get it done by tonight (PST) so our European friends can have at the RC on their Monday morning.

The attached patch does fix the above test case, so are you okay with it Mathieu?

What do you think, Raj?  is it okay to apply this patch or do you want to look deeper?
Comment 17 Rajib Bhattacharjea 2008-12-14 23:14:23 UTC
FYI I am looking deeper.  Will have an answer in a few minutes.
Comment 18 Rajib Bhattacharjea 2008-12-14 23:57:54 UTC
The following changeset, which was applied to ns-3-dev yesterday:
http://code.nsnam.org/ns-3-dev/rev/d320dea20aca

When applied to Mathieu's branch, "fixes" the problem, i.e. there is no segfault.

Instead of pinning the minimum estimate (est) to zero, it pins the minimum RTO to a configurable value, exported through the attributes system (i.e. old BSDs use 1.0s, Linux uses 0.2s, etc)

The fact that 0-0 = -1e-9 is another bug I think?  As far as I can tell, it has something to do with fast and slow values...

As it stands, the changeset checked-in to ns-3-dev fixes the issue.  We need to understand the rounding / precision issue separately and file another bug report.

I think this bug is okay to close.  Craig?
Comment 19 Craig Dowell 2008-12-15 00:13:21 UTC
I'm fine with clamping the RTO at some minimum value > 0 -- makes sense.  

If the RTO is pinned to some positive minimum value I don't see how this can avoid fixing the problem; but I'd like to let Mathieu close the bug if he considers the issue resolved to avoid any further close/repoen games.
Comment 20 Rajib Bhattacharjea 2008-12-15 00:23:07 UTC
Works for me.  The token has been passed to Mathieu.
Comment 21 Mathieu Lacage 2008-12-15 08:37:23 UTC
The only thing I really care about is that the segfault is fixed in my testcase: how it is fixed is not something I can comment on because I did not try to look at the issue very much. From a high-level perspective, this looked like a classical example of a floating point substraction rounding error where you substract 2 values which are almost equal but not exactly equal and the result is basically undefined unless you perform a post-processing explicit rounding/clamping.

To summarize, I have no idea where the problem really came from or if your fix is correct. I am just a bug reporter.

FYI, I have no access to a test machine right now so, I don't even know if your patch fixes the segfault but I assume that you have tested this already.

Comment 22 Rajib Bhattacharjea 2008-12-15 11:24:29 UTC
(In reply to comment #21)
> To summarize, I have no idea where the problem really came from or if your fix
> is correct. I am just a bug reporter.
> 
> FYI, I have no access to a test machine right now so, I don't even know if your
> patch fixes the segfault but I assume that you have tested this already.
> 

Yes, it was tested.  It fixes your segfault.