Bug 793

Summary: Many still-reachable memory leaks detected in the system
Product: ns-3 Reporter: Craig Dowell <craigdo>
Component: helpersAssignee: ns-bugs <ns-bugs>
Status: RESOLVED FIXED    
Severity: blocker CC: faker.moatamri, mathieu.lacage
Priority: P1    
Version: pre-release   
Hardware: All   
OS: All   
Attachments: no creation of a new simulator when calling simulator::cancel or remove or IsExpired
remove remaining memory leaks due to not cleaned Simulator

Description Craig Dowell 2010-01-15 18:12:50 UTC
If the test.py and wutils.py patch from bug 792 (Neither test.py nor waf --valgrind catch all kinds of memory leaks. ) are applied to ns-3-dev, there are seventeen memory leaks found in the system.

A definitely-lost error in routing-aodv-regression has a separate bug filed, and a patch under that bug clears the problem.

The remainder are new, previously not noticed errors.

these will be flagged by test.py (and waf --run --valgrind) once the 792 patch is committed and should be P1 blockers.
Comment 1 Craig Dowell 2010-01-15 18:52:42 UTC
I picked on simple-point-to-point-olsr and took a look at what is happening with these errors.

It seems that Simulator::Destroy is called at the end of the example main () and the SimulatorImpl is destroyed and pimpl is zeroed.

After Simulator::Destroy returns, the locals in main() are torn down.

One of these locals is a node container.  The container starts tearing its stuff down.  Eventually it works its way to something with a routing protocol and drops the reference count.  When the routing protocol reference count drops to zero, it is deleted.  This means that the Timer objects in the olsr routing protocol are deleted.

In ~Timer, a check is made for the CANCEL_ON_DESTROY bit, which is true.  It then calls m_event.Cancel () which means a call to Simulator::GetImpl (), where pimpl is found to be zero and a new simulator impl is created.

Oops.

Trying to figure out what happened between ns-3.6 and now to trigger this ...
Comment 2 Craig Dowell 2010-01-15 20:15:44 UTC
The leak in simple-point-to-point-olsr, at least, seems to have been triggered by a checkin in early to mid November.

This version does not show the leak,

changeset:   5801:86b47c42316c
parent:      5795:d06e2b6e581e
user:        Fabian Mauchle <fabian.mauchle@hsr.ch>
date:        Mon Nov 02 09:48:21 2009 +0100
summary:     Ipv6Extension-Option link

but this does

changeset:   5810:6afb0f719e53
user:        Sebastien Vincent <vincent@clarinet.u-strasbg.fr>
date:        Wed Nov 18 14:27:38 2009 +0100
summary:     Coding style;

That's as far as I can get for now -- I'm completely out of time.
Comment 3 Craig Dowell 2010-01-15 23:06:20 UTC
It looks like the following commit triggered the leak in simple-point-to-point-olsr:

changeset:   5805:db141a163ef3
parent:      5804:8a41c3e4e8c7
parent:      5534:3bc5cbc9431d
user:        Sebastien Vincent <vincent@clarinet.u-strasbg.fr>
date:        Sun Nov 15 11:30:20 2009 +0100
summary:     Merge with ns-3-dev and fix doxygen.

If you look toward the end of the changeset, you'll find,

-static Ptr<SimulatorImpl> *PeekImpl (void)
+static SimulatorImpl **PeekImpl (void)

-  static Ptr<SimulatorImpl> impl = 0;
+  static SimulatorImpl *impl = 0;

So it looks like the Ptr<SimulatorImpl> was being destroyed correctly, but the SimulatorImpl * introduced in this changeset was just being left alone.

I don't know why this change was done, so I don't want to just start changing things.  I'll leave this one with our friends in France to sort out.

To summarize, I took a look at the leak in simle-point-to-point-olsr and found that Simulator::Destroy is called at the end of the example main() function
and the SimulatorImpl is destroyed and pimpl is zeroed there.

After Simulator::Destroy returns, the locals in main() are torn down.  One of these locals is a node container.  The container starts tearing its stuff down.  Eventually it works its way to something with a routing protocol and drops the reference count.  When the routing protocol reference count drops to zero, it is deleted.  This means that the Timer objects in the olsr routing 
protocol are deleted.

In ~Timer, a check is made for the CANCEL_ON_DESTROY bit, which is true.  It
then calls m_event.Cancel() which implies a call to Simulator::GetImpl().  In this call pimpl is found to be zero and a new simulator impl is created, making pimpl point to allocated memory again.

It used to be the case that pimpl was a Ptr, so eventually a destructor was called on that Ptr by the static constructor and destructor code which released the memory.  This no longer happens and the memory is not freed.

This may explain a number of the leaks found in the system now.
Comment 4 Craig Dowell 2010-01-16 00:05:45 UTC
Just to illustrate, the following static destructor hack placed in simulator.cc fixes the issue:

----------
class MagicDestructor {
public:
  MagicDestructor () {};
  ~MagicDestructor () {Simulator::Destroy ();};
};

MagicDestructor magic;
----------

With this and the m_sockets.clear() in ipv4-l3-protocol.cc it seems to clean up most of our problems.
Comment 5 Mathieu Lacage 2010-01-18 05:02:05 UTC
(In reply to comment #3)
> It looks like the following commit triggered the leak in
> simple-point-to-point-olsr:
> 
> changeset:   5805:db141a163ef3
> parent:      5804:8a41c3e4e8c7
> parent:      5534:3bc5cbc9431d
> user:        Sebastien Vincent <vincent@clarinet.u-strasbg.fr>
> date:        Sun Nov 15 11:30:20 2009 +0100
> summary:     Merge with ns-3-dev and fix doxygen.

This is just a merge changeset. The original changeset was:
changeset:   5521:37c6c83d4252
user:        Guillaume Seguin <guillaume@segu.in>
date:        Sat Nov 14 17:47:05 2009 +0100
summary:     Introduce Simulator::ScheduleWithContext* and Simulator::GetContext

And I am the one who put the patch together based on the original patch by guillaume.

> If you look toward the end of the changeset, you'll find,
> 
> -static Ptr<SimulatorImpl> *PeekImpl (void)
> +static SimulatorImpl **PeekImpl (void)
> 
> -  static Ptr<SimulatorImpl> impl = 0;
> +  static SimulatorImpl *impl = 0;
> 
> So it looks like the Ptr<SimulatorImpl> was being destroyed correctly, but the
> SimulatorImpl * introduced in this changeset was just being left alone.
> 
> I don't know why this change was done, so I don't want to just start changing
> things.  I'll leave this one with our friends in France to sort out.

The rationale for this change was twofold:
  - make sure we controlled precisely calls to Ref/Unref to avoid having to make their implementation thread-safe for the threadsafe implementation of ScheduleWithContext
  - make sure we catch the kind of error you describe below which incorrectly creates a dead simulator after Simulator::Destroy.

Sadly, we did not catch the latter because when we made our checkin, there were no users of Timer (actually, no user of CANCEL_ON_DESTROY) and, then, new users of Timer did not catch it when they started using it because test.py -g did not report it.

[snip]

> It used to be the case that pimpl was a Ptr, so eventually a destructor was
> called on that Ptr by the static constructor and destructor code which released
> the memory.  This no longer happens and the memory is not freed.
> 
> This may explain a number of the leaks found in the system now.

Yes. The fix you propose below indeed works but the original intention of the initial change was to avoid this. Instead, I would support making Simulator::Cancel and Simulator::Remove return unconditionally if the simulator impl does not exist yet instead of creating one.

I believe that faker is testing a patch along those lines.
Comment 6 Craig Dowell 2010-01-18 13:19:08 UTC
> The fix you propose below indeed works but the original intention
> of the initial change was to avoid this. 

You know, I'm not really proposing to actually add something to ns-3 when I prefix it with "just to illustrate," call it a "static destructor hack," and give it a name like MagicDestructor.  Sigh.

It's purpose was to *illustrate* that this was indeed the root of all of the new memory leak problems and when it was fixed properly, most or all of the memory leaks would go away.

[ ... ]

> new users of Timer did not catch it when they started using it because
> test.py -g did not report it.

None of our tools reported it.  This is because valgrind didn't consider it an error and I guess nobody thought it important to go beyond what valgrind reported was an important error.

We should probably get more specific about acceptance tests -- what exactly must exist (tests) and what tools should be used to bless the code and how.
Comment 7 Mathieu Lacage 2010-01-19 03:09:12 UTC
(In reply to comment #6)
> > The fix you propose below indeed works but the original intention
> > of the initial change was to avoid this. 
> 
> You know, I'm not really proposing to actually add something to ns-3 when I
> prefix it with "just to illustrate," call it a "static destructor hack," and
> give it a name like MagicDestructor.  Sigh.

yes, yes, I understood it that way and I was not complaining: I just wanted to explain a bit the rationale behind the current behavior so that it is documented somewhere.

> It's purpose was to *illustrate* that this was indeed the root of all of the
> new memory leak problems and when it was fixed properly, most or all of the
> memory leaks would go away.
> 
> [ ... ]
> 
> > new users of Timer did not catch it when they started using it because
> > test.py -g did not report it.
> 
> None of our tools reported it.  This is because valgrind didn't consider it an

I thought that ./waf --regression --valgrind would have reported it.

> error and I guess nobody thought it important to go beyond what valgrind
> reported was an important error.

I was working under the assumption that when I did test.py -g it was going to catch these errors with the same trick we had in ./waf --regression --valgrind. This was not so but it's fixed now, so, hopefully, we don't have to be worried about it anymore.

> We should probably get more specific about acceptance tests -- what exactly
> must exist (tests) and what tools should be used to bless the code and how.

Well, to answer the tools question, I thought that our criterion was that "test.py -g" needs to pass and we enforced that with our nightly builds. 

For the 'tests' question, I don't really know what the answer is. Clearly we don't have enough and we need more and we need to convince people to contribute more tests. I don't know how to do that though.
Comment 8 Faker Moatamri 2010-01-19 09:18:06 UTC
Created attachment 728 [details]
no creation of a new simulator when calling simulator::cancel or remove or IsExpired

no creation of a new simulator when calling simulator::cancel or remove or IsExpired
Comment 9 Faker Moatamri 2010-01-19 09:19:37 UTC
Created attachment 729 [details]
remove remaining memory leaks due to not cleaned Simulator

Simulator is created but not cleaned
Comment 10 Faker Moatamri 2010-01-19 09:20:45 UTC
Those patches should fix the memory leaks, please review, test and comment.
Comment 11 Mathieu Lacage 2010-01-19 10:13:51 UTC
(In reply to comment #10)
> Those patches should fix the memory leaks, please review, test and comment.

both look good for me. +1.
Comment 12 Faker Moatamri 2010-01-20 04:13:12 UTC
Changesets: b89ce2e9eed5 and a0e24e8844da