Bugzilla – Bug 1753
Halting issue with DistributedSimulatorImpl
Last modified: 2013-08-15 14:57:56 UTC
While doing some large parallel NS3 runs we found there were some conditions which caused the DistributedSimulatorImpl event processing loop to hang. Two related failure modes where observed and we have implemented a relatively minor code change to address the issues. First, in the edge case where MPI tasks do not have any Nodes the algorithm will hang. While unlikely to happen for manually constructed Node distributions, this can occur if automated partitioning tools are used. Second, if the latencies on the inter task P2P links are different the existing algorithm may hang. The lookAhead value is based on the minimum latency for the inter task P2P links for each task. The granted time window may thus different across tasks and a Stop event maybe scheduled for execution on one MPI task (the tasks with a larger lookAhead) but not others in an iteration of the event loop. This will result in a hang in the AllGather. The unifying issue is the problem of coordinating all the MPI tasks to so that the global AllGathers continue to execute until all tasks are completed. Tasks that are “locally completed” need to continue to participate in the AllGathers. I will submit a patch for review with Rietveld. Steve Smith
Created attachment 1661 [details] Testcase that hangs prior to patch
pushed patch and updated python bindings changeset: 10153:4952c9e4573a Thanks for reporting the bug and submitting a patch!