Bugzilla – Bug 1762
UE stuck in IDLE_CONNECTING because RRC CONN REQ is not transmitted
Last modified: 2014-09-04 16:32:21 UTC
Created attachment 1671 [details] Test suite to reproduce the problem ./waf --run="test-runner --suite=lte-rrc-conn-establishment"
Created attachment 1672 [details] Illustration of network topology of the test suite
Created attachment 1673 [details] Some logs taken from the test suite simulation
Hi, The attached patch "bug1762-reproduce.diff" contains a test suite "lte-test-rrc-conn-establishment.cc" that reproduces the issue. It's an LTE-only simulation with topology illustrated in the attached figure "bug1762-reproduce.png". The UE is positioned in a way so that it has ~30% error rate with its serving cell. The simulation uses real RRC protocol. The issue is reproduced using RngRun = 3, but not reproduced when using RngRun = 2. The attached log file "bug1762-reproduce-log.txt" contains a selection of lines of log from the simulation that I think are important. In summary, the chronology of the problem looks like below: 1. random access is successful 2. UE RRC tries to send RRC Connection Request to the serving cell 3. eNodeB MAC scheduler allocate UL Tx opportunity 4. eNodeB MAC sends UL DCI to UE MAC (but this is failed because of the error model) So the UE doesn't know that it is allowed to send the RRC Connection Request message. For the rest of the simulation, the message is still in UE's buffer but the eNodeB no longer allocate UL resource for that. Eventually the eNodeB will timeout and then delete the UE context. From this point on the UE stuck in its IDLE_CONNECTING state. I'm not sure what would be the expected behaviour. My guess is that the UE should timeout, notify upper layer about the failure, and fall back to IDLE_CAMPED_NORMALLY state. I can try to fix this bug if we can confirm on the expected behaviour. -budi-
The UL grant for Message 3 (RRC Conn Req) is sent in the RAR, not in the DCI, so I think it is received correctly. My guess is that RRC Conn Req is transmitted but not received correctly. Maybe a look at the PHY TX/RX stats can shed some light on this. I agree that in principle the UE RRC should handle this failure, however I expected that HARQ would have prevented the loss in most cases...
According to the last HARQ model, HARQ is able to correctly decode at the 3rd retransmission up to -16 dB; however, for small packets like RRC ones, it can decode only up to -14 dB for TBs of ~ 100 bits and -12 dB for TBs of 40 bits.
I've checked TS 36.331, the expected behavior is that a timer (T300) is started when RRC CONN REQ is sent, and stopped if RRC CONN SETUP is received. If it expires before, the connection establishment procedure terminates with failure, see section 5.3.3.6 Budiarto, if you could implement this it would be great! To start with, I'd use for T300 a hardcoded value of 100ms (the minimum allowed). Actually T300 shall be passed along with SIB2, but it's currently unsupported in our code.
Um... wait. I'm still not sure if RRC Connection Request is actually transmitted here. Firstly, is it right that the line "UE 1: UL-CQI notified TxOpportunity..." in the log at +0.0282143s indicates that a UL DCI is received by the UE? This happens after RAR is received, i.e. +0.0202143s, and only happen in simulation with RngRun = 2 (PASS). I'm also uploding a new attachment, "bug1762-reproduce-phystats.txt". It contains a combined PHY stats from both RngRun = 2 and RngRun = 3 cases. As shown inside, RngRun = 2 produces a complete set of PHY stats files which correctly indicate transmissions at TTI 32 and 35 (I think these are RRC Connection Request and RRC Connection Setup). On the other hand, RngRun = 3 only generates 1 file (DlRsrpSinrStats.txt), which I think implies that RRC Connection Request is not transmitted.
Created attachment 1674 [details] Collection of PHY stats generated by the test suite
I think it's a bug: RRC CONNECTION REQUEST should be sent using the UL GRANT contained in the MAC RAR, not the UL DCI (which seems to be what's actually used. I've filed this as Bug 1763. The reason why I filed it separately is that, even if RRC CONNECTION REQUEST is TX and RX ok, RRC CONNECTION COMPLETED will be sent over SRB1, hence it relies on the UL DCI. So the test would fail anyway later. So, for the present bug, maybe we need just to redefine the test vector? If the UE moves out of range before the RRC connection is established, it is expected that it fails.
After the fix for Bug 1763, the provided test passes. This because only the state at the UE RRC is checked (not the eNB RRC, which in fact does not reach CONNECTED_NORMALLY). This is consistent with my previous comment.
I'll return to this bug after GSoC ends.
Created attachment 1787 [details] Proposed fix (T300 implementation and test) ./waf --run="test-runner --suite=lte-rrc-conn-establishment"
Hi Nicola and Marco, I finally found some time to work on this issue. The latest attachment (bug1762-fix-draft.txt) contains the following: - implementation of T300 (100 ms by default) in LteUeRrc - the modified test suite When T300 expires, it notifies NAS that connection establishment has failed and switch the UE to IDLE_CAMPED_NORMALLY. As an alternative behaviour, I've also considered implementing a retry mechanism, but it would be good to hear your opinion first. The test suite now verifies the UE state on both UE side and eNodeB side. One interesting observation is that if RRC Connection Setup Complete message fails to reach the eNodeB, the UE context will stay in CONNECTION_SETUP state indefinitely. I couldn't find any specification that says what should be done in this scenario. For example, should the context be removed after certain timeout? As a temporary measures, I write the test to accept CONNECTION_SETUP as an expected state. If you're happy with this fix, I'll proceed with removing the debugging code and updating the documentation. Cheers, -budi-
Hi Budi, thanks for the patch, and sorry for the very late reply! I reviewed the patch and I am happy with it, so I would appreciate if you could proceed with cleaning up the debugging code, updating the documentation and pushing everything to ns-3-dev. If by the way you could add a meaningful name string to the TestCase constructor, it would be great! Tom had announced a planned freeze at the end of this week for the ns-3.20 release. I think it would be nice to include this fix in the release. If you are not available to push the change within this week, I could eventually apply & push it for you. > When T300 expires, it notifies NAS that connection establishment has failed > and switch the UE to IDLE_CAMPED_NORMALLY. As an alternative behaviour, I've > also considered implementing a retry mechanism, but it would be good to hear > your opinion first. I think this is not covered by the standard but I would expect that real devices do something similar. Probably many ns-3 users would find this useful. > > The test suite now verifies the UE state on both UE side and eNodeB side. > One interesting observation is that if RRC Connection Setup Complete message > fails to reach the eNodeB, the UE context will stay in CONNECTION_SETUP > state indefinitely. I couldn't find any specification that says what should > be done in this scenario. For example, should the context be removed after > certain timeout? As a temporary measures, I write the test to accept > CONNECTION_SETUP as an expected state. As a temporary measure it's ok, however to stick with the current implementation approach we shall add a timeout. The eNB RRC state machine currently implements many such timeouts http://www.nsnam.org/docs/models/html/lte-design.html#enb-rrc-state-machine and IIRC at least some (if not all) of them are not specified by the standard, but rather were added just to make the implementation more robust.
Hi Nicola, (In reply to Nicola Baldo from comment #15) > I reviewed the patch and I am happy with it, so I would appreciate if you > could proceed with cleaning up the debugging code, updating the > documentation and pushing everything to ns-3-dev. If by the way you could > add a meaningful name string to the TestCase constructor, it would be great! Do you mean names such as "transmission failure in RRC Connection Setup message" and "successful establishment, followed by successful reconfiguration"? These names are accurate for now, but they heavily depend on the random error model, so there's no guarantee that they will be accurate in the future. Anyway, it's a good idea. Once in a while we'll need to remember to check if the names are still accurate. > > When T300 expires, it notifies NAS that connection establishment has failed > > and switch the UE to IDLE_CAMPED_NORMALLY. As an alternative behaviour, I've > > also considered implementing a retry mechanism, but it would be good to hear > > your opinion first. > > I think this is not covered by the standard but I would expect that real > devices do something similar. Probably many ns-3 users would find this > useful. Alright, I'll make UE NAS retry connecting immediately and indefinitely. In addition, I'll also update connection rejection to retry in a similar fashion. > > The test suite now verifies the UE state on both UE side and eNodeB side. > > One interesting observation is that if RRC Connection Setup Complete message > > fails to reach the eNodeB, the UE context will stay in CONNECTION_SETUP > > state indefinitely. I couldn't find any specification that says what should > > be done in this scenario. For example, should the context be removed after > > certain timeout? As a temporary measures, I write the test to accept > > CONNECTION_SETUP as an expected state. > > As a temporary measure it's ok, however to stick with the current > implementation approach we shall add a timeout. The eNB RRC state machine > currently implements many such timeouts > http://www.nsnam.org/docs/models/html/lte-design.html#enb-rrc-state-machine > and IIRC at least some (if not all) of them are not specified by the > standard, but rather were added just to make the implementation more robust. Understood. I'll add a timer for this purpose. But if I understand correctly, then the UE wouldn't get any resources from the eNodeB, because the bearers on the eNodeB side are still offline. The UE RRC state at this point would be CONNECTED_NORMALLY, so it would become a difficult problem to debug. It seems that the standard specification relies on RLF to take care of this issue on the UE side (not the RLF because of physical problems, but the RLF because of maximum number of retransmissions has been reached at RLC). Do we have this mechanism in place? > Tom had announced a planned freeze at the end of this week for the ns-3.20 > release. I think it would be nice to include this fix in the release. If you > are not available to push the change within this week, I could eventually > apply & push it for you. To be honest, I prefer not to rush and include it in ns-3.20. There's still retry mechanism, new timeout at eNodeB, additional testing, and some documentation updates that I have to do. I'll try to get them done as much as I can today and let you know again tomorrow. -budi-
I've implemented the retry mechanism and the eNodeB timer as previously discussed. Then an existing test suite, lte-rrc, fails. I'm afraid I can't make it fix before the code freeze. -budi-
Created attachment 1822 [details] Proposed fix Hi Nicola, The previous error in lte-rrc test suite was due to the UE retrying to connect, so its RRC state didn't stay at IDLE_CAMPED_NORMALLY as the test case had previously expected. I updated the test case by removing this test condition in cases where admitRrcConnectionRequest is false. Here I uploaded a proposed patch to be pushed. Some things to note: - this includes implementation of T300 (UE RRC), connection setup timeout (eNodeB RRC), and connection retry in case of timeout or rejection (UE NAS) - I combined the test cases with the lte-rrc test suite - I added a new method HasUeManager in LteEnbRrc, which returns true if the UE context with the given RNTI is available Please let me know if you have any comments on this. I'll proceed with the push latest at the end of this week. Regards, -budi-
Created attachment 1826 [details] Proposed fix I found that test suite lte-rr-ff-mac-scheduler has failed, specifically at the last test case. This test case uses 15 UEs and it failed because some of them took more than 100 ms for connection setup. Because of this, the timeout at eNodeB removed the context of those UEs, leading to the failure. Increasing the default timeout duration from 100 ms to 150 ms has fixed this issue, as I did in the updated patch. I suspect that longer timeout duration would be needed in cases where there are more UEs attempting to connect at the same time. BTW, I created a Rietveld issue for this bug: https://codereview.appspot.com/90600044/ Cheers, -budi-
Hi Budi, thanks for the great work on this! I am mostly ok with the patch, the only thing that puzzles me is how you address the interference issue, specifically this (from lte-testing.rst): "In cases where interference is high, we accommodate one retry attempt by the UE, so we double the :math:`d^{ce}` value and then add :math:`d^{si}` on top of it (because the timeout has reset the previously received SIB2)." I first did not understand why this. In your scenario, you have interferece only on the PDCCH, but not on the PDSCH nor on the PUSCH, as all UEs are attached to a single eNB. Hence, interference will play a role in generating possibly a CQI of 0 (out of range) but not in causing actual errors. This puzzled me, so I did some tests with your patch for this test case: nUes=1, nBearers=1, tConnBase=0, tConnIncrPerUe=0, delayDiscStart=1, ueDistance=0.55, real RRC, admitRrcConnectionRequest = true, failure at RRC Connection Setup, then successful - it fails with the original delay value - it passes with the original delay value and with no interfering eNB (thus confirming it's about interference) - it passes with your new delay value I enabled PHY tracing and saw a SINR of 0.257078 (-5.8994 dB). This means the UE is considered out-of-range because of interference. This is confirmed by the PF scheduler logging "any UE found" and by the LteAmc logging CQI 0 as soon as CQI feedback is available. Now, I am not sure why increasing the timer makes the test pass. I thought it would be because of the CqiTimerThreshold that expires and causes a transmission to be scheduled, which then succeedes because there is no interference on the PDSCH and PUSCH in this test scenario. But the test still passes after increasing CqiTimerThreshold, so it must be some other issue. Anyway, to conclude, I think the test vector is not correct, in that for such a high value of interference the UE is not expected to get connected when the real RRC is used.
Hi Nicola, As per our discussion in the annual developers meeting, I made the following changes to the tests. The motivation behind the new test cases in lte-rrc suite is to verify the robustness of the RRC connection establishment procedure in handling signalling error. To enforce transmission error of a specific signalling message, the UE teleports to a far and undesirable spot just before that particular signalling message is transmitted. The new test case class inherits from the existing class and then adds this teleporting feature. The signalling messages tested are RRC connection request, RRC connection setup, and RRC connection setup complete. The only problem is in the first case, where the RRC connection request is still transmitted successfully. I tried tuning the timing of the teleport up and down a bit, but it doesn't seem to change anything. Any idea why the message still goes through? The test still passes though. The second and third cases work as expected (the message does not reach the destination, timeout occurs, and the UE retries). Finally, I removed the commented test cases that were labeled as "time consuming". The previous Rietveld issue is updated with the above changes. Please see the file src/lte/test/test-lte-rrc.cc for the changes. Cheers, -budi-
Hi Budi, thanks a lot for the revised patch! I finally managed to have some time to dig into this issue, see my comments below. (In reply to Budiarto Herman from comment #21) > The motivation behind the new test cases in lte-rrc suite is to verify the > robustness of the RRC connection establishment procedure in handling > signalling error. To enforce transmission error of a specific signalling > message, the UE teleports to a far and undesirable spot just before that > particular signalling message is transmitted. The new test case class > inherits from the existing class and then adds this teleporting feature. ok! > > The signalling messages tested are RRC connection request, RRC connection > setup, and RRC connection setup complete. The only problem is in the first > case, where the RRC connection request is still transmitted successfully. I > tried tuning the timing of the teleport up and down a bit, but it doesn't > seem to change anything. Any idea why the message still goes through? I've tried different timings as well and I confirm that, with your setup, if the RRC connection request message is transmitted, it is always received! Finally I've found the reason: the UL grant that is being used to sent RRC connection request is fixed and defaults to the most robust MCS; CQI is not used. As discussed in #C20, in this scenario interference is only in the control part and affects only CQI calculations, but not the actual transmission. Hence, the transmission succedes. I propose to fix it by changing the teleport parameter so that, instead of moving to a high interference position, the UE is moved to a really far away location (10000.0, 0.0, 0.0). With this, and a jump away time of 0.0202144, I could achieve the desired behavior: the RRC connection setup is not received by the eNB, and the UE reattempts connection later (and succedes). > > Finally, I removed the commented test cases that were labeled as "time > consuming". ok > > The previous Rietveld issue is updated with the above changes. Please see > the file src/lte/test/test-lte-rrc.cc for the changes. I think with the revised values that I mentioned it's ok to merge! I'd appreciate if you could go ahead and push the change by yourself, if you have the opportunity; if not, just let me know and I'll do it.
Finally I applied the last proposed changes, and also added some further test conditions to make sure that the connection is not established at the time when it is expected not to be. Thanks again Budi for the great work! changeset: 10850:310b021e4598 user: Budiarto Herman <buherman@gmx.com> date: Thu Jul 31 14:00:06 2014 +0200 summary: fix Bug 1762 - UE stuck in IDLE_CONNECTING because RRC CONN REQ is not transmitted changeset: 10851:674e0a46b808 tag: tip user: Nicola Baldo <nbaldo@cttc.es> date: Fri Aug 01 14:30:42 2014 +0200 summary: refined fix for Bug 1762
Created attachment 1862 [details] Additional fix (jump away location) Hi Nicola, Sorry for being absent for a while, as I was in a long vacation. But thank you for your feedback and changes. I rechecked the test suite and still found two issues at LteRrcConnectionEstablishmentErrorTestCase. In short, I fixed both of these two issues by pushing the jump away location from (10000, 0, 0) to (100000, 100000, 0). The patch of this simple fix is attached, and I can push it to ns-3-dev if you're okay. Some detailed explanation follows: > I propose to fix it by changing the teleport parameter so that, instead of moving to a high interference position, the UE is moved to a really far away location (10000.0, 0.0, 0.0). With this, and a jump away time of 0.0202144, I could achieve the desired behavior: the RRC connection setup is not received by the eNB, and the UE reattempts connection later (and succedes). I assume that we are within the context of case #1 (failure in RRC Connection Request) here, so I assume you meant "RRC Connection Request" instead of "RRC Connection Setup". I included your changes and re-run the test case again, then I found that the above mentioned behaviour has not been achieved yet, i.e., RRC connection request still transmitted successfully despite the UE has jumped far away. I confirmed this by the presence of "INITIAL_RANDOM_ACCESS --> CONNECTION_SETUP" line in the log. The test case passed anyway because the following RRC Connection Setup has failed to transmit (therefore UE timeout and attempt reconnection). The second issue that I found is regarding case #3 (failure in RRC Connection Setup Complete). Same as before, I found the RRC Connection Setup Complete message is transmitted successfully ("CONNECTION_SETUP --> CONNECTED_NORMALLY" line in the log). The CheckNotConnected test unexpectedly passed in this case because the UE context state was CONNECTION_RECONFIGURATION, i.e., regarded as not connected.
I also want to note on the following part of documentation that was removed: > Ideally, the UE context at the serving eNodeB would have an RRC state of > CONNECTED_NORMALLY at the end of the procedure. But in the rare case of error > while transmitting RRC CONNECTION SETUP COMPLETE message, the eNodeB would have > removed the context because of *connection setup timeout*. A better way to > handle this error is to make the UE fall back to Idle mode and retry the > connection, but this behaviour is not yet implemented at the moment. The text describes the second issue from the previous comment (see also Comment #16). We don't have any mechanism to handle this issue. When it happens, the test simply informs with NS_LOG_WARN and acknowledges the case as PASS. So I have a change in mind and propose the following. Since we're not yet able to properly handle the failure of RRC Connection Setup Complete, let's file it as a known issue (bug). The fix of this new bug would be part of RLF implementation in the future. In addition, let's drop case #3 from the test suite for now. Is this approach okay for you? -budi-
Hi Budi, thanks for coming back again on this issue! I am reopening the bug after your comments. I am fine with the new patch, feel free to push it at your earliest convenience. As for case #3 (failure in RRC Connection Setup Complete), I agree with dropping it. In addition to the issues that you commented, I think we forgot that RRC Connection Setup Completed is sent using SRB1, hence it is using RLC AM. This means that 1) it's much less likely to fail, 2) if it fails, the UE shall trigger RLF (not implemented at the moment) and hence yes it's expected that the eNB will have a timeout to remove the UE context. But we don't have all these features; adding the timeout alone at the eNB would leave the UE side inconsistent. So I agree to treat this case as unsupported for now.
Done. changeset: 10880:838eb3c66675 tag: tip user: Budiarto Herman <budiarto.herman@magister.fi> date: Thu Sep 04 23:13:41 2014 +0300 summary: Updated fix for Bug 1762 - UE stuck in IDLE_CONNECTING because RRC CONN REQ is not transmitted Thank you Nicola and Marco for the discussions. -budi-