Bug 1002

Summary: Segment fault in Ipv4L3Protocol::Send() if no route to host
Product: ns-3 Reporter: Bill Roome <wdr>
Component: internetAssignee: ns-bugs <ns-bugs>
Status: RESOLVED WORKSFORME    
Severity: normal CC: tomh
Priority: P5    
Version: ns-3.9   
Hardware: All   
OS: All   
Attachments: Annotated trace output from a run.
trace output from a run.

Description Bill Roome 2010-10-05 14:33:45 UTC
Ipv4L3Protocol::Send() can dereference a null pointer. Consider the m_dropTrace() call in the following statement (case 5, at the end of the function):

  if (newRoute)
    {
      ....
    }
  else
    {
      NS_LOG_WARN ("No route to host.  Drop.");
      m_dropTrace (ipHeader, packet, DROP_NO_ROUTE, m_node->GetObject<Ipv4> (), 0);
    }

If there's no route, m_node can be 0, and m_node->GetObject() core dumps. I think the solution is to use a conditional, as in:

      m_dropTrace (ipHeader, packet, DROP_NO_ROUTE,
                   (m_node != 0 ? m_node->GetObject<Ipv4> () : 0), 0);

That works in my simulations.
Comment 1 Tom Henderson 2010-10-05 16:02:24 UTC
(In reply to comment #0)

> If there's no route, m_node can be 0, 

m_node ought not to be 0 regardless of whether there is a route or not.  m_node is set during simulation construction phase when Ipv4L3Protocol is aggregated to the node, so this type of bug would be symptomatic of the simulation not being plumbed together correctly.  Can you post a test case?

In general, we decided a while ago to not aggressively check for pointer validity in our models when there is reason to believe that they should always be valid, so I'd like to understand what is causing this before checking for nonzero m_node in this model everywhere.
Comment 2 Bill Roome 2010-10-06 16:28:24 UTC
I don't have a simple test case, but here's some additional info. The problem happened when I closed the TCP socket. Thanks to a few print statements, I discovered that something called the Ipv3L3Protocol object's DoDispose() method before I called the socket's Close() method. And DoDispose() sets m_node to 0, so Send() croaked.

Then I realized I'd been calling Close() from my app's DoDispose() method, so I moved the Close() to my app's StopApplication(). When I did that, the simulation terminated cleanly, without crashing.

So I guess this is "pilot error." But I think it's a pretty subtle one, and it wasn't a problem in 3.7 and earlier. I'll let y'all decide if it's worth checking for this kind of error. Or maybe you could just spell out the rules about what to do in StopApplication() vs DoDispose().
Comment 3 Tom Henderson 2010-10-06 17:27:03 UTC
(In reply to comment #2)
> I don't have a simple test case, but here's some additional info. The problem
> happened when I closed the TCP socket. Thanks to a few print statements, I
> discovered that something called the Ipv3L3Protocol object's DoDispose() method
> before I called the socket's Close() method. And DoDispose() sets m_node to 0,
> so Send() croaked.
> 
> Then I realized I'd been calling Close() from my app's DoDispose() method, so I
> moved the Close() to my app's StopApplication(). When I did that, the
> simulation terminated cleanly, without crashing.
> 
> So I guess this is "pilot error." But I think it's a pretty subtle one, and it
> wasn't a problem in 3.7 and earlier. I'll let y'all decide if it's worth
> checking for this kind of error. Or maybe you could just spell out the rules
> about what to do in StopApplication() vs DoDispose().

This seems like a bug in the way things are being disposed.  If you could turn on all NS_LOG logging and then attach the snippet of log output (plus your printfs) starting a little bit before the DoDispose() in question (leading up to the crash), it may reveal the problem.
Comment 4 Bill Roome 2010-10-11 15:03:04 UTC
Created attachment 997 [details]
Annotated trace output from a run.

Okay, I ran it again with logging enabled. I've attached the annotated output, complete with a stack trace.
Comment 5 Tom Henderson 2010-10-12 00:09:06 UTC
(In reply to comment #4)
> Created an attachment (id=997) [details]
> Annotated trace output from a run.
> 
> Okay, I ran it again with logging enabled. I've attached the annotated output,
> complete with a stack trace.

I think the problem is here:
	#5  0x01b2fba5 in ns3::TcpSocketImpl::Close ()
	#6  0x01ca6756 in ns3::HttpFetch::Close ()
	#7  0x0000c6ac in ns3::HttpClient2::DoDispose ()

Is your DoDispose() method trying to close sockets?  This triggers the Tcp implementation to try to send a segment, but the node has already been partially torn down since you are at Simulator::Destroy() time.  

If you remove the socket close() calls in your application dispose methods and just zero the socket pointer, does it work properly?
Comment 6 Bill Roome 2010-10-12 11:32:03 UTC
Created attachment 998 [details]
trace output from a run.
Comment 7 Bill Roome 2010-10-12 11:32:53 UTC
Yes, I'm calling Socket::Close() from the app's DoDispose method, and when I remove the explicit call to Socket::Close(), and just set the Ptr<Socket> to 0, it terminates cleanly, without crashing.

I've attached the output from that case.

Here's a bit more information on the structure of my app. HttpClient2 is an Application. It contains an instance of HttpFetch. HttpFetch is a C++ object, but not an NS3 Object. HttpFetch contains a Ptr<Socket>, and manages it. HttpFetch::Close deletes callbacks, closes the socket, and sets the Ptr<Socket> to 0. The HttpFetch d'tor calls Close().

In 3.7 and earlier, my HttpClient2 app called HttpFetch::Close from DoDispose rather than StopApplication. I did that when I was first learning ns3. In retrospect, now that I know more about ns3, I think it does make more sense to call Close from StopApplication rather than DoDispose.
Comment 8 Bill Roome 2010-10-12 13:41:27 UTC
It turns out to be difficult to call Socket::Close() from the app's StopApplication() method. Normally our simulations run until a monitor process decides that it has seen enough. The monitor then calls Simulator::Stop().

In that case, the app's StopApplication() methods never get called.

But applications' d'tors do get called, and those call the d'tors for our HttpFetch objects, and those d'tors in turn call Socket::Close(), and the program core dumps.

My current work-around is to skip the Socket::Close() call if Simulator::IsFinished() returns true. That's very easy, but I worry whether any other gotchas are lurking around.

I'd welcome any suggestions. I suppose our monitor could schedule StopApplication calls for every app, instead of just calling Simulator::Stop(). But I'd rather not go through that much hassle unless it's really necessary.
Comment 9 Tom Henderson 2010-10-13 00:35:46 UTC
(In reply to comment #8)
> It turns out to be difficult to call Socket::Close() from the app's
> StopApplication() method. Normally our simulations run until a monitor process
> decides that it has seen enough. The monitor then calls Simulator::Stop().
> 
> In that case, the app's StopApplication() methods never get called.
> 
> But applications' d'tors do get called, and those call the d'tors for our
> HttpFetch objects, and those d'tors in turn call Socket::Close(), and the
> program core dumps.
> 
> My current work-around is to skip the Socket::Close() call if
> Simulator::IsFinished() returns true. That's very easy, but I worry whether any
> other gotchas are lurking around.
> 
> I'd welcome any suggestions. I suppose our monitor could schedule
> StopApplication calls for every app, instead of just calling Simulator::Stop().
> But I'd rather not go through that much hassle unless it's really necessary.

Do you need the sockets to close, or to call StopApplication() before Simulator::Stop() is called?  Or is it sufficient to just clean up properly and release the memory?  Note that calling Socket::Close() is an action that triggers further (packet generating, in many cases) events, but if you are in Simulator::Stop() phase, then no further events are allowed.  

I would recommend that you do not try to close sockets out of your application destructors; just zero the socket pointers if you like.  See how some of the ns-3 applications (UdpEcho, OnOffApplication) using sockets do it.
Comment 10 Bill Roome 2010-10-15 12:00:53 UTC
I'm calling Close() to cleanup, free memory, etc. I was afraid that merely zeroing the pointer wouldn't clean up everything.

However, I allocate and deallocate HttpFetch objects during the course of the simulation, and they connect and disconnect to http servers. So the HttpFetch d'tor needs to do whatever is necessary to ensure the tcp stack tells the other end that the peer went away, even when invoked in the middle of the simulation.

If zeroing the pointer really does all that, then I'd be happy to drop the Close(). Although if that's the case, I have to wonder why Close() exists at all.
Comment 11 Tom Henderson 2010-10-15 13:50:23 UTC
(In reply to comment #10)
> I'm calling Close() to cleanup, free memory, etc. I was afraid that merely
> zeroing the pointer wouldn't clean up everything.
> 
> However, I allocate and deallocate HttpFetch objects during the course of the
> simulation, and they connect and disconnect to http servers. So the HttpFetch
> d'tor needs to do whatever is necessary to ensure the tcp stack tells the other
> end that the peer went away, even when invoked in the middle of the simulation.
> 
> If zeroing the pointer really does all that, then I'd be happy to drop the
> Close(). Although if that's the case, I have to wonder why Close() exists at
> all.

I think that Close() should be decoupled from destroy and cleanup.  Close is an modeled event that causes TCP FIN to be sent; it can be normally called in the course of a simulation. 

If you find that you can solve this by disassociating your socket closes from object destructors and DoDispose, can you close this bug report?
Comment 12 Bill Roome 2010-10-18 09:34:23 UTC
Yes, no problem. Sorry, I didn't realize you were waiting for me to close it.