Bug 1753 - Halting issue with DistributedSimulatorImpl
Halting issue with DistributedSimulatorImpl
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: mpi
ns-3-dev
All All
: P5 normal
Assigned To: George Riley
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-12 12:25 UTC by Steven Smith
Modified: 2013-08-15 14:57 UTC (History)
3 users (show)

See Also:


Attachments
Testcase that hangs prior to patch (2.39 KB, text/x-c++src)
2013-08-15 14:40 UTC, Brian Swenson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Smith 2013-08-12 12:25:44 UTC
While doing some large parallel NS3 runs we found there were some conditions which caused the DistributedSimulatorImpl event processing loop to hang.  Two related failure modes where observed and we have implemented a relatively minor code change to address the issues.

First, in the edge case where MPI tasks do not have any Nodes the algorithm will hang.  While unlikely to happen for manually constructed Node distributions, this can occur if automated partitioning tools are used.

Second, if the latencies on the inter task P2P links are different the existing algorithm may hang.   The lookAhead value is based on the minimum latency for the inter task P2P links for each task.   The granted time window may thus different across tasks and a Stop event maybe scheduled for execution on one MPI task (the tasks with a larger lookAhead) but not others in an iteration of the event loop.  This will result in a hang in the AllGather.

The unifying issue is the problem of coordinating all the MPI tasks to so that the global AllGathers continue to execute until all tasks are completed.   Tasks that are “locally completed” need to continue to participate in the AllGathers.

I will submit a patch for review with Rietveld.

Steve Smith
Comment 1 Brian Swenson 2013-08-15 14:40:22 UTC
Created attachment 1661 [details]
Testcase that hangs prior to patch
Comment 2 Brian Swenson 2013-08-15 14:57:04 UTC
pushed patch and updated python bindings
changeset:  10153:4952c9e4573a
Thanks for reporting the bug and submitting a patch!