Bugzilla – Bug 1190
Suppress hello if bcast was sent within the last hello interval
Last modified: 2013-08-16 01:23:46 UTC
RFC 3561 states Every HELLO_INTERVAL milliseconds, the node checks whether it has sent a broadcast (e.g., a RREQ or an appropriate layer 2 message) within the last HELLO_INTERVAL.If it has not, it MAY broadcast a RREP with TTL = 1, called a Hello message. We need to suppress hello if bcast was sent within the last hello interval
agree; not sure whether it will have major impact, but I agree for conformance to fix this
I was investigating why the logic in aodv-routing-protocol.cc made no sense in RecvRequest and SendRequest where it says: if (!m_htimer.IsRunning ()) Since the NOT is there, this will very likely never be true, or if it is, it really would have no purpose. I now discovered that the code existed without the NOT (!) before the patch for bug 1193, and was before that point logical code. I believe the functionality bug 1190 claims does not exist was actually half-way implemented before the patch for bug 1193 was implemented. I see based on the timing of the bugs, that bug 1193 was patched to improve performance more than for conformance and appeared after this bug 1190. So, perhaps the author discovered where the hello was suppressed/rescheduled after this bug 1190, and noticed that functionality was breaking adjacencies and so suggested to disable it. Before patch for bug 1193 we at least reset the hello interval timer. I believe the fix for 1193 was counter to what this bug is hoping to fix. *** But, what is the true root cause of the problem of bug 1193 if we are going to undo what bug 1193 did to restore functionality that bug 1190 claims is not there? *** My guess is that it is more due that the node's who saw the RREQ's didn't refresh their adjacency timers for that node, like stated in RFC: A node MAY determine connectivity by listening for packets from its set of neighbors. If, within the past DELETE_PERIOD, it has received a Hello message from a neighbor, and then for that neighbor does not receive any packets (Hello messages or otherwise) for more than ... ALLOWED_HELLO_LOSS * HELLO_INTERVAL milliseconds, the node SHOULD assume that the link to this neighbor is currently lost. When this happens, the node SHOULD proceed as in Section 6.11. Also, the title of bug 1193 ("Hello timers scheduling skewed when RREP, RREQ are processed"), I think incorrectly documents that RREPs were resetting the hello timer schedule. According to my interpretation of the code and the patch for bug 1193, it was only when sending a nodes own RREQ or forwarding another nodes RREQ that the hello interval timer was reset. Thus, in order for us to fix this bug 1190, we may need an alternate solution to bug 1193. Also, I think the implementation that did exist resets the hello interval instead of choosing not to send a hello packet during that hello interval. The RFC appears to me to state that we simply do not send a hello for the interval, but anyway, I think this may be dangerous because there could be a higher chance for links to go down due to missed hellos. Consider the following: Success Scenario: t0.0 - hello sent by node A per interval of 1 sec t0.0 - hello seen by node B (refresh ALLOWED_HELLO_LOSS * HELLO_INTERVAL timer to 2.0) t0.5 - RREQ sent by A t0.5 - RREQ from A seen by B (refresh ALLOWED_HELLO_LOSS * HELLO_INTERVAL timer to 2.0) t1.0 - hello not sent by A, since RREQ sent (B adj A timer at 1.5) t2.0 - hello sent by A, since no RREQ sent (B adj A timer at 0.5) t2.0 - hello seen by B (refresh ALLOWED_HELLO_LOSS * HELLO_INTERVAL timer to 2.0) Failure Scenario: t0.0 - hello sent by node A per interval of 1 sec t0.0 - hello seen by node B (refresh ALLOWED_HELLO_LOSS * HELLO_INTERVAL timer to 2.0) t0.5 - RREQ sent by A t0.5 - RREQ from A seen by B (refresh ALLOWED_HELLO_LOSS * HELLO_INTERVAL timer to 2.0) t1.0 - hello not sent by A, since RREQ sent (B adj A timer at 1.5) t2.0 - hello sent by A, since no RREQ sent (B adj A timer at 0.5) t2.0 - lets say collision caused loss of hello, B did not see hello from A t2.5 - B marks A as down as timer expired I guess this is more due to the low default allowed hello loss suggested, but the suppression of the hello rather than the resetting of the node's hello interval potentially causes more chance for link failure. If we reset the hello interval, then in the failure scenario presented above there would of been two hellos sent instead of one before the timer expired (it would of been at 1.5 and 2.5 instead of just at 2.0). The second hello cuts it close anyway, since there is no room for any delay (node B fast approaches timer expire by the time node A's second hello makes it to B). So, I say, if we can fix bug 1193 another way, then we can restore the functionality of the RFC mentioned in bug 1190 either by using the same logic already built in (reset hello interval) or we can suppress hello on presence of RREQ (I have the logic in mind for the second method). What say you?
that is not true the Bug1188,90,93 were submitted at the same time due aodv performance issues and have no direct connection with each other when filed. Please describe what you mean by if (!m_htimer.IsRunning ()) Since the NOT is there, this will very likely never be true, or if it is, it really would have no purpose. I now discovered that the code existed without the NOT (!) before the patch for bug 1193, and was before that point logical code. Why can't this true? Without that code, aodv repeatedly cancelled hellos without checking if there was an active hello timer already, and therefore resulting in loss of adjacency. Please also describe the problem you are facing with the current AODV implementation
(In reply to comment #3) > that is not true the Bug1188,90,93 were submitted at the same time due aodv > performance issues and have no direct connection with each other when filed. > 1190 and 1193 seem very related to me. The logic below which appears in two places seems to me to be the functionality in part that was described in 1190 as not existing. As I mentioned in my first note, it did not suppress the hello, but rather reset the hello interval. 909 if (EnableHello) 910 { 911 if (!m_htimer.IsRunning ()) 912 { 913 m_htimer.Cancel (); 914 m_htimer.Schedule (HelloInterval - Time (0.01 * MilliSeconds (m_uniformRandomVariable->GetInteger (0, 10)))); 915 } 916 } 917 } Without the logical not, I read this as the following: If the hello timer is running, then cancel it, and set a new schedule according to the hello interval minus 0-1 ms. > Please describe what you mean by > > if (!m_htimer.IsRunning ()) > > Since the NOT is there, this will very likely never be true, or if it is, it > really would have no purpose. I now discovered that the code existed without > the NOT (!) before the patch for bug 1193, and was before that point logical > code. > > Why can't this true? Because, the logic says the following: If the hello timer is not running, then cancel it, and set a new schedule according to the hello interval minus 0-1 ms. The call to cancel the timer won't do anything, because the timer is not running. See timer.h cancel description: 179 /** 180 * Cancel the currently-running event if there is one. Do nothing 181 * otherwise. 182 */ Also, why are we trying to set a hello timer if it was not running before? No where else in the code is it cancelled without rescheduling except when an interface goes down or address is removed. I didn't see any explanation in bug 1193 in adding that logical not except to stop the rescheduling from happening. So, if I missed something, then I would ask you when it would be true or is supposed to be true? > > Without that code, aodv repeatedly cancelled hellos without checking if there > was an active hello timer already, and therefore resulting in loss of > adjacency. > > Please also describe the problem you are facing with the current AODV > implementation I understand that the patch in 1193 stopped the issue of lost adjacency, but I think that the fix for 1193 stops us from doing what is described in this bug, which is that we are not suppressing hellos when there are recent RREQs: Every HELLO_INTERVAL milliseconds, the node checks whether it has sent a broadcast (e.g., a RREQ or an appropriate layer 2 message) within the last HELLO_INTERVAL.If it has not, it MAY broadcast a RREP with TTL = 1, called a Hello message. So, I was saying, we should go back to 1193 and find a different fix (i.e. make sure presence of RREQs resets the timers) and then either return to resetting the timer or suppressing the hello, whichever is a better and more accurate interpretation of the RFC. The problem I am facing in AODV implementation is different than this issue. I will be opening a bug for it soon, but the stuff I was fixing was just really close to this code in RecvRequest, so I tried to understand what this code was doing at that point. Then I saw these open bugs for it and that it was from 2011, so am seeing what I can do to help.
it is normal practice to Cancel before re-scheduling to absolutely ensure the existence of only one timer. Sure it is redundant many times, but it is safe. The "!" prevents an active hello (hello scheduled to be sent) to go ahead and expire at its scheduled time. Since RREQ can happen at random occasions, the old code, would keep randomly canceling scheduled hellos for long stretches of time. But Bug1193's fix aims to provide the least collateral damage , at the same time address lingering abysmal performance issues prior to it, until other bugs 1188-92 can be addressed the proper way. Now since EnableHello(true) , it is important to keep the default implementation to use Hellos for neighbor adjacency and thus it is important that the active hellos continue to remain untouched by RREQ bcasts. I.e , RREQ should not cancel hellos. RREQ could care less about hellos. Rather, hellos should defer themselves for later, if a recent bcast such as RREQ was sent. The correct implementation in my opinion should have AODV working in two modes, 1. EnableHello == false 2. EnableHello == true We need empirical data to know if we can bravely set EnableHello == false and still have application traffic flowing Therefore I am not convinced that Bug1190 and Bug1193 are directly related. Bug1190 lays the blame squarely at the hellos component for not checking if a bcast was sent. Bug1193, lays the blame on the RREQ component for interfering with the hellos directly. RREQs don't know or care about hellos, but hellos must know about recent RREQs. Just my opinion.
oh wait, I see where you are coming from. They are indeed possibly related. But when I filed all those bugs , they were filed at once, and I hated the fact that SendRequest or ReceiveRequest should have any connection with Hellos. They really should not. Maybe we can automatically address Bug1190, if we allow "SendRequest" to cancel and restart Hellos. But we might still have to leave the "!" part in "ReceiveRequest". I recall pushing the change to both places to be safe as long as "EnableHello(true)" is the default. Further, I would love to see if AODV works with "EnableHello == false" and make EnableHello == false the default. Fix any lingering Hello issues after that. Hoping somebody will volunteer to try it out.
Also note should anybody test potential fixes: RERR and RREQ retry can be broadcasts. All bcasts could set the lastBcastTime. And when hello timers expire they could check lastBcastTime before re-scheduling. In general, no component except the Hello component should cancel/schedule hellos. Places like Send/Recv Req, Send/Recv Error need not be aware of hello timers.
(In reply to comment #5) > it is normal practice to Cancel before re-scheduling to absolutely ensure the > existence of only one timer. Sure it is redundant many times, but it is safe. Understood. I see that done in HelloTimerExpire as well. > The "!" prevents an active hello (hello scheduled to be sent) to go ahead and > expire at its scheduled time. Can you clarify? Why can't we just remove lines 909-917 and 1176-1184? > Now since EnableHello(true) , it is important to keep the default > implementation to use Hellos for neighbor adjacency and thus it is important > that the active hellos continue to remain untouched by RREQ bcasts. > > I.e , RREQ should not cancel hellos. RREQ could care less about hellos. Rather, > hellos should defer themselves for later, if a recent bcast such as RREQ was > sent. > I think I see what you are saying here. But, whether we defer the hello or reset the interval (which we do neither now), the result can be that the neighbor adjacency goes down, since the only way to update that route is with a hello message. We can see this at play in ProcessHello, where the only way to get a neighbor route is to get a hello and the only way to reset the neighbor route (line 1430) is to get a hello message from the neighbor. So, is it that the RFC provides conflicting statements? 6.9. Hello Messages paragraph one says that hellos may not be sent in the presence of RREQ, but in paragraph 3 we are reliant on the presence of a hello for a neighbor route lifetime update, else we lose the route, possibly resulting in some consequences for active routes through the nodes? (In reply to comment #6) > Maybe we can automatically address Bug1190, if we allow "SendRequest" to cancel > and restart Hellos. > > But we might still have to leave the "!" part in "ReceiveRequest". > I recall pushing the change to both places to be safe as long as > "EnableHello(true)" is the default. I say we use your later idea with having a flag or time variable. I was thinking the same thing. > Further, I would love to see if AODV works with "EnableHello == false" and make > EnableHello == false the default. Fix any lingering Hello issues after that. > Hoping somebody will volunteer to try it out. I will check into it's behavior with Hello off. (In reply to comment #7) > Also note should anybody test potential fixes: RERR and RREQ retry can be > broadcasts. All bcasts could set the lastBcastTime. > > And when hello timers expire they could check lastBcastTime before > re-scheduling. In general, no component except the Hello component should > cancel/schedule hellos. Places like Send/Recv Req, Send/Recv Error need not be > aware of hello timers. Are you saying lastBcastTime could be a Time value member variable? Instead, how about a bool variable that RREQ/RERR functions can set, and HelloTimerExpire checks that flag to decide what to do. Are you more in favor of suppression of hello for that interval or rescheduling of the interval?
(In reply to comment #8) > (In reply to comment #5) > > it is normal practice to Cancel before re-scheduling to absolutely ensure the > > existence of only one timer. Sure it is redundant many times, but it is safe. > > Understood. I see that done in HelloTimerExpire as well. > > > The "!" prevents an active hello (hello scheduled to be sent) to go ahead and > > expire at its scheduled time. > > Can you clarify? Why can't we just remove lines 909-917 and 1176-1184? > We can remove all such occurences in places that should not be aware of the existence of HelloTimers at all, such as RecvReq and SendReq, SendErr etc. > > Now since EnableHello(true) , it is important to keep the default > > implementation to use Hellos for neighbor adjacency and thus it is important > > that the active hellos continue to remain untouched by RREQ bcasts. > > > > I.e , RREQ should not cancel hellos. RREQ could care less about hellos. Rather, > > hellos should defer themselves for later, if a recent bcast such as RREQ was > > sent. > > > > I think I see what you are saying here. But, whether we defer the hello or > reset the interval (which we do neither now), the result can be that the > neighbor adjacency goes down, since the only way to update that route is with a > hello message. That is correct, the places that have re-scheduling of hello, should not be aware of hellos. Let them set the m_lastBcastTime and move on. > > We can see this at play in ProcessHello, where the only way to get a neighbor > route is to get a hello and the only way to reset the neighbor route (line > 1430) is to get a hello message from the neighbor. > > So, is it that the RFC provides conflicting statements? 6.9. Hello Messages > paragraph one says that hellos may not be sent in the presence of RREQ, but in > paragraph 3 we are reliant on the presence of a hello for a neighbor route > lifetime update, else we lose the route, possibly resulting in some > consequences for active routes through the nodes? My understanding is that Hellos are optional. Unfortunately nobody can confidently say that "EnableHello == false" as default allows traffic to low. We don't have any solid experimental results to set it to false, .....yet. > (In reply to comment #6) > > Maybe we can automatically address Bug1190, if we allow "SendRequest" to cancel > > and restart Hellos. > > > > But we might still have to leave the "!" part in "ReceiveRequest". > > I recall pushing the change to both places to be safe as long as > > "EnableHello(true)" is the default. > > I say we use your later idea with having a flag or time variable. I was > thinking the same thing. +1 for one 64-bit time variable. > > > Further, I would love to see if AODV works with "EnableHello == false" and make > > EnableHello == false the default. Fix any lingering Hello issues after that. > > Hoping somebody will volunteer to try it out. > > I will check into it's behavior with Hello off. Thank you. We Appreciate your efforts greatly. > > (In reply to comment #7) > > Also note should anybody test potential fixes: RERR and RREQ retry can be > > broadcasts. All bcasts could set the lastBcastTime. > > > > And when hello timers expire they could check lastBcastTime before > > re-scheduling. In general, no component except the Hello component should > > cancel/schedule hellos. Places like Send/Recv Req, Send/Recv Error need not be > > aware of hello timers. > > > Are you saying lastBcastTime could be a Time value member variable? > > Instead, how about a bool variable that RREQ/RERR functions can set, and > HelloTimerExpire checks that flag to decide what to do. Anything works as long as , we can ensure (e.g., a RREQ or an appropriate layer 2 message) within the last HELLO_INTERVAL. But I do anticipate that we will be allowed to change the hello interval/even stop hellos, during simulation, at which point, we don't have traceability to when the last bcast was done. Just something to think about, for now maybe a bool is good enough, if hellos are rescheduled exactly only in one place. > > Are you more in favor of suppression of hello for that interval or rescheduling > of the interval? Wouldn't they be the same thing? Hellos should never stop unless upper layer user event forced it to or no active routes exist
> > Are you more in favor of suppression of hello for that interval or rescheduling > > of the interval? > > Wouldn't they be the same thing? Hellos should never stop unless upper layer > user event forced it to or no active routes exist This is the difference I see. Suppression: 0.0 - Hello sent 1.0 - Hello sent 1.3 - RREQ sent 2.0 - Hello not sent 3.0 - Hello sent 3.5 - RREQ sent 4.0 - Hello not sent 4.1 - RREQ sent 5.0 - Hello not sent 6.0 - Hello sent Rescheduling: 0.0 - Hello sent 1.0 - Hello sent 1.3 - RREQ sent 2.3 - Hello sent 3.3 - Hello sent 3.5 - RREQ sent 4.1 - RREQ sent 5.1 - Hello sent 6.1 - Hello sent The code in RecvRequest and SendRequest uses/used rescheduling of the interval. Whereas, the paragraph in the RFC makes it sound like the hello interval is fixed. Logic: Suppression: HelloTimerExpired if (m_lastBcastTime < now - m_htimer.GetDelay () ) //GetDelay approx = H_I SendHello Cancel timer Schedule timer with HELLO_INTERVAL - tiny time Rescheduling (done in HelloTimerExpired which seems bad): HelloTimerExpired if (m_lastBcastTime < now - m_htimer.GetDelay () ) SendHello Cancel timer Schedule timer with HELLO_INTERVAL - tiny time else Cancel timer Schedule timer with m_lastBcastTime + HELLO_INTERVAL - tiny time
Yes, although they look like constants, A particular mobile node may wish to change certain of the parameters, in particular the NET_DIAMETER, MY_ROUTE_TIMEOUT, ALLOWED_HELLO_LOSS, RREQ_RETRIES, and possibly the HELLO_INTERVAL. it is not clear if AODV has to be restarted if these parameters change. Further, researchers should be allowed to tune these for their papers for adaptive timers based on scenarios. can you explain in the rescheduling part. Schedule timer with m_lastBcastTime + HELLO_INTERVAL - tiny time why do we involve m_lastBcastTime to calculate the next hello timer expiry? if ! (m_lastBcastTime < now - m_htimer.GetDelay () )) it indicates we don't care about m_lastBcastTime anymore Did I miss something? please consider this if ((now - m_lastBcastTime) < Current_Hello_interval) suppress else sendhello, reschedule for new hello after current_hello_interval
(In reply to comment #11) > Yes, although they look like constants, > A particular mobile node > may wish to change certain of the parameters, in particular the > NET_DIAMETER, MY_ROUTE_TIMEOUT, ALLOWED_HELLO_LOSS, RREQ_RETRIES, and > possibly the HELLO_INTERVAL. > it is not clear if AODV has to be restarted if these parameters change. > Further, researchers should be allowed to tune these for their papers for > adaptive timers based on scenarios. Yes, yes, I should-of/meant-to written it using the member variable, HelloInterval. > can you explain in the rescheduling part. > > Schedule timer with m_lastBcastTime + HELLO_INTERVAL - tiny time > > why do we involve m_lastBcastTime to calculate the next hello timer expiry? > > if ! (m_lastBcastTime < now - m_htimer.GetDelay () )) > > it indicates we don't care about m_lastBcastTime anymore > > Did I miss something? If we are going with rescheduling, can't we only reschedule it in HelloTimerExpire, which will only be called if the timer actually expires? If the timer expires and there was a recent bcast, then I involve the last bcast time because we are rescheduling the HelloInterval based on last bcast time like done in RecvRequest and SendRequest, but we are doing it in the expiry function. This is the only way I saw to keep a similar functionality doing it in the expiry function. But, anyway it sounds like you think suppression is the right way. Did my time/event example help illustrate where I am with the difference between the two (suppress vs reschedule the actual interval)? > please consider this > > if ((now - m_lastBcastTime) < Current_Hello_interval) > suppress > else > sendhello, reschedule for new hello after current_hello_interval If we suppress, we need to reschedule in the same manner as if we sent the hello. So, that is why I had that component outside of the conditional statement. Here is a compromise between yours and mine: HelloTimerExpired if ( now - m_lastBcastTime < HelloInterval ) SendHello else nothing reschedule for new hello for HelloInterval - MilliSeconds ( randomVar(0,10) ) I'm testing with Hellos turned off by attribute, and so far things look good testing in the aodv example with some changes. I'm going to run AODV tests, then change the default behavior to disabled, and see if there is a problem. I think I see one code piece that is a problem. I read through the NS-3 testing document just to get more familiar. I am not sure if I can produce a new full test for this change, but I was thinking of adapting the excellent multi-test wifi-adhoc.cc to run a lot of examples, but I guess how do I validate things are working then, right? Any suggestions? I will open a separate bug with all my results, as that seems appropriate. Let me know if not.
Did you mean to say if ( now - m_lastBcastTime < HelloInterval ) nothing else Send hello reschedule for new hello for HelloInterval - MilliSeconds ( randomVar(0,10) ) if yes, then I agree. The manet-routing example is a good validator (scaled to 75 nodes and different rngRun for mobility). The receive rate should be on-par with the current implementation or better. In any case, we need to verify that a randomly selected set of nodes say 5 nodes exhibit the following a. Every hello interval atleast one bcast is done, be it hello, RREQ or RERR
(In reply to comment #13) > Did you mean to say > > if ( now - m_lastBcastTime < HelloInterval ) > nothing > else > Send hello > > reschedule for new hello for HelloInterval - MilliSeconds ( randomVar(0,10) ) > if yes, then I agree. Yes, I did mean the logic you have mentioned. I flipped to a greater than sign and kept SendHello as the action for true in the patch I am actively working on. > The manet-routing example is a good validator (scaled to 75 nodes and different > rngRun for mobility). The receive rate should be on-par with the current > implementation or better. Ok, on the increased nodes. For mobility, do you mean to run different seeds or something to get a few different trials? > In any case, we need to verify that a randomly selected set of nodes say 5 > nodes exhibit the following > a. Every hello interval atleast one bcast is done, be it hello, RREQ or RERR Can do. In example runs of aodv so far, I increased the number of ping apps going on and it looks good.
(In reply to comment #14) > (In reply to comment #13) > > Did you mean to say > > > > if ( now - m_lastBcastTime < HelloInterval ) > > nothing > > else > > Send hello > > > > reschedule for new hello for HelloInterval - MilliSeconds ( randomVar(0,10) ) > > if yes, then I agree. > > Yes, I did mean the logic you have mentioned. I flipped to a greater than sign > and kept SendHello as the action for true in the patch I am actively working > on. > > > The manet-routing example is a good validator (scaled to 75 nodes and different > > rngRun for mobility). The receive rate should be on-par with the current > > implementation or better. > > Ok, on the increased nodes. For mobility, do you mean to run different seeds > or something to get a few different trials? Yes, thats right, random mobility tests for atleast 5 different seeds and compare with the non-patched version. > > In any case, we need to verify that a randomly selected set of nodes say 5 > > nodes exhibit the following > > a. Every hello interval atleast one bcast is done, be it hello, RREQ or RERR > > Can do. In example runs of aodv so far, I increased the number of ping apps > going on and it looks good. thanks
Created attachment 1584 [details] Consolidated patch for 1190, 1188, and 1193
Created attachment 1585 [details] My testcase used to generate performance analysis of patch
Created attachment 1586 [details] AODV Patch Performance Comparison Compared current AODV, proposed patch in this bug, and proposed patch for 1629.
Created attachment 1633 [details] Patch See attached performance comparison charts
Created attachment 1634 [details] Performance comparisons
Created attachment 1638 [details] Diff from bug1190 which might also fix Bug1629 Diff from bug1190 which might also fix Bug1629
author John Abraham <john.abraham.in@gmail.com> Thu, 15 Aug 2013 09:48:40 -0700 (12 hours ago) changeset 10149 03e018548f7a