Bug 2119

Summary: valgrind leaks intermittently reported for fd-net-device dummy-network
Product: ns-3 Reporter: Tom Henderson <tomh>
Component: fd-net-deviceAssignee: alina <aquereilhac>
Status: RESOLVED FIXED    
Severity: normal CC: ns-bugs, pdbarnes
Priority: P5    
Version: pre-release   
Hardware: PC   
OS: Linux   
Attachments: [FdNetDevice] Patch to free buffers with incoming traffic that were not processed at the end of the simulation
[DefaultSimulatorImpl] Patch to free events remaining in the m_eventsWithContext list at the end of the simulation

Description Tom Henderson 2015-05-13 09:32:42 UTC
Some of our valgrind tests on Ubuntu 14.04 64-bit server are lately experiencing intermittent valgrind errors that look like:

==57926== Command: /home/buildslave/jenkins/workspace/Daily-with-valgrind/label/Ubuntu-64-14.04/ns-3-allinone/ns-3-dev/build/src/fd-net-device/examples/ns3-dev-dummy-network-debug
==57926== 
==57926== 
==57926== HEAP SUMMARY:
==57926==     in use at exit: 1,578 bytes in 2 blocks
==57926==   total heap usage: 5,374 allocs, 5,372 frees, 380,585 bytes allocated
==57926== 
==57926== 1,522 bytes in 1 blocks are indirectly lost in loss record 1 of 2
==57926==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==57926==    by 0x4E604CA: ns3::FdNetDeviceFdReader::DoRead() (fd-net-device.cc:63)
==57926==    by 0x6B4C03F: ns3::FdReader::Run() (unix-fd-reader.cc:217)
==57926==    by 0x6B4D046: ns3::MemPtrCallbackImpl<ns3::FdReader*, void (ns3::FdReader::*)(), void, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator()() (callback.h:407)
==57926==    by 0x4E6AAFC: ns3::Callback<void, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator()() const (callback.h:1092)
==57926==    by 0x6B4A078: ns3::SystemThread::DoRun(void*) (system-thread.cc:84)
==57926==    by 0x7B40181: start_thread (pthread_create.c:312)
==57926==    by 0x7E5047C: clone (clone.S:111)
==57926== 
==57926== 1,578 (56 direct, 1,522 indirect) bytes in 1 blocks are definitely lost in loss record 2 of 2
==57926==    at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==57926==    by 0x4E6944B: ns3::EventImpl* ns3::MakeEvent<void (ns3::FdNetDevice::*)(unsigned char*, long), ns3::FdNetDevice*, unsigned char*, long>(void (ns3::FdNetDevice::*)(unsigned char*, long), ns3::FdNetDevice*, unsigned char*, long) (make-event.h:390)
==57926==    by 0x4E64E0A: ns3::FdNetDevice::ReceiveCallback(unsigned char*, long) (fd-net-device.cc:302)
==57926==    by 0x4E6EDD9: ns3::MemPtrCallbackImpl<ns3::FdNetDevice*, void (ns3::FdNetDevice::*)(unsigned char*, long), void, unsigned char*, long, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator()(unsigned char*, long) (callback.h:422)
==57926==    by 0x6B4C7C0: ns3::Callback<void, unsigned char*, long, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator()(unsigned char*, long) const (callback.h:1107)
==57926==    by 0x6B4C089: ns3::FdReader::Run() (unix-fd-reader.cc:227)
==57926==    by 0x6B4D046: ns3::MemPtrCallbackImpl<ns3::FdReader*, void (ns3::FdReader::*)(), void, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator()() (callback.h:407)
==57926==    by 0x4E6AAFC: ns3::Callback<void, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator()() const (callback.h:1092)
==57926==    by 0x6B4A078: ns3::SystemThread::DoRun(void*) (system-thread.cc:84)
==57926==    by 0x7B40181: start_thread (pthread_create.c:312)
==57926==    by 0x7E5047C: clone (clone.S:111)
==57926== 
==57926== LEAK SUMMARY:
==57926==    definitely lost: 56 bytes in 1 blocks
==57926==    indirectly lost: 1,522 bytes in 1 blocks
==57926==      possibly lost: 0 bytes in 0 blocks
==57926==    still reachable: 0 bytes in 0 blocks
==57926==         suppressed: 0 bytes in 0 blocks


For instance:

https://ns-buildmaster.ee.washington.edu:8010/job/Daily-with-valgrind/822/label=Ubuntu-64-14.04/consoleText


These are intermittent, and there haven't been any code changes recently in this part of the code, so I'm not sure whether this is a local system issue that may resolve over time.  I also repeated this example, on this same host, ten times in a row and could not repeat it.
Comment 1 alina 2015-06-03 02:02:51 UTC
I confirm this intermittent error. I could reproduce it in another Ubuntu 14.04 machine. I did not managed to reproduce it in a Debian machine. I will investigate further.
Comment 2 alina 2015-07-15 11:14:06 UTC
The example dummy-network of the fd-net-device required the SimulatorImplementationType to be set to ns3::RealtimeSimulatorImpl. After adding this change I could no longer reproduce the memory leak. This fix was committed with changeset 4be177372dc4. 

I'll leave the bug report open until the automated valgrind tests on the Ubuntu 14.04 64-bit server confirm the problem is no longer present.
Comment 3 Peter Barnes 2015-07-15 14:21:22 UTC
By changing the SimulatorImplementationType are we just papering over a real issue, presumably in DefaultSimulatorImpl?  Or the threading required for FdNetDevice?
Comment 4 alina 2015-07-15 16:10:43 UTC
In this case, the FdNetDevice requires the RealtimeSimulatorImpl because there are external elements to the simulator that work in real time (such as the socketpair pipes used in the example to communicate the devices). Using the DefaultSimulatorImpl caused the example not to run correctly (e.g. ARP replies and request not synchronized, blocking the ICMP exchange). The non-freed memory might have been a consequence of the simulation not executing properly.
Comment 5 Peter Barnes 2015-07-15 20:20:20 UTC
Thanks for the explanation.
Comment 6 Tom Henderson 2015-07-16 14:04:34 UTC
Alina, it seems now that 'dummy-network' and 'realtime-dummy-network' are essentially identical programs; should 'dummy-network' be removed?  

If not removed, we need to provide dummy-network only when bld.env["ENABLE_REAL_TIME"] is set, because since you committed the patch, it is building but crashing at runtime on OS X with:
assert failed. cond="uid != 0", msg="Assert in TypeId::LookupByName: ns3::RealtimeSimulatorImpl not found", file=../src/core/model/type-id.cc, line=567
libc++abi.dylib: terminating
Comment 7 alina 2015-07-17 10:33:13 UTC
You are right Tom. I will remove the dummy-network example in this case.
Comment 8 alina 2015-07-17 13:28:06 UTC
I checked the other test for the fd-net-device (the fd2fd-onoff example), which also uses the DefaultSimulatorImpl and a socketpair, and it is also producing intermittent memory leaks in my Ubuntu 14.04. It makes sense not to use the DefaultSimulatorImpl when using external elements which might have real-time constraints (e.g. socketpairs), but since despite the memory leaks the fd2fd-onoff example seems to complete correctly, there might be something else going on. I would like to take more time to look into this issue before deciding to remove the dummy-network test. For the time being I will revert my last changeset, so the tests in OS X don't fail, until I have more information.
Comment 9 alina 2015-07-29 14:58:08 UTC
Created attachment 2104 [details]
[FdNetDevice] Patch to free buffers with incoming traffic that were not processed at the end of the simulation
Comment 10 alina 2015-07-29 15:01:25 UTC
Created attachment 2105 [details]
[DefaultSimulatorImpl] Patch to free events remaining in the m_eventsWithContext list at the end of the simulation
Comment 11 alina 2015-07-29 15:09:25 UTC
After a closer inspection, the memory leak problem seems to come from the simulation finishing before all packets are processed (This does not happen when using the RealtimeSimulatorImpl). 

Problem description: The FdNetDeviceFdReader, in the FdNetDevice, reads incoming traffic into buffers (using a dedicated thread), and these buffers are passed as arguments to the ForwardUp callback for processing. For each buffer, a new ForwardUp Event is created and scheduled on the main simulation thread (using the scheduleWithContext method). When using the DefaultSimulatorImpl, it might happen that the simulation ends before all Events are executed, and this causes the memory allocated for the unprocessed Events, and for the corresponding buffers, to remain un-freed at the end of the simulation. There are two separated causes for the memory leak when using the DefaultSimulatorImpl:

1. Buffers allocated with incoming traffic not being freed when Events remain non-executed.

2. Non-executed Events not being freed at the end of the simulation when they are scheduled with ScheduleWithContext.

The two patches attached solve these issues. The first patch (attachment 2104 [details]) replaces the m_pendingReadCount counter, used in the FdNedDevice, by a queue that holds pointers the unprocessed buffers, so that any remaining buffers can be freed in the FdNetDevice destructor. The second patch (attachment 2105 [details]) fixes the DefaultSimulatorImpl so that the DoDispose method destroys any remaining events in the m_eventsWithContext list.

Though these patches should fix the intermittent memory leaks for the dummy-network test, it still remains the issue of whether it makes sense to use the DefaultSimulatorImpl with the FdNetDevice when a socket pair is used. This combination can produce wrong simulation results (e.g. 100% packet loss in the dummy-network example when run with "./test.py --example=dummy-network --grind --retain --verbose" although this does not happen when run with "./waf --run=dummy-network"). If you are ok with these patches, I will proceed and apply them. 

Regarding the dummy-network test (and also the fd2fd-onoff test), I am not sure whether to remove them or to leave them.
Comment 12 Peter Barnes 2015-07-29 16:07:23 UTC
I'm ok with the patch to default-simulator-impl.cc.

An alternative would be to just call ProcessEventsWithContext() at the beginning of DoDispose, which will move all the pending events-with-context to the regular event list.

DefaultSimulatorImpl::DoDispose (void)
{
  NS_LOG_FUNCTION (this);
  ProcessEventsWithContext ();
  while (!m_events->IsEmpty ())
    {
      Scheduler::Event next = m_events->RemoveNext ();
  ...
Comment 13 alina 2015-08-04 16:56:29 UTC
Patches committed (with Peter's suggested modification) in changeset 941beab1b849