Bug 933

Summary: Flushing ostream and files on abnormal program exit (ASSERT, ABORT and FATAL_ERROR)
Product: ns-3 Reporter: Quincy Tse <quincy.tse>
Component: coreAssignee: Quincy Tse <quincy.tse>
Status: RESOLVED FIXED    
Severity: enhancement CC: antti.makela, craigdo, gjcarneiro, mathieu.lacage, ns-bugs, tomh
Priority: P5    
Version: ns-3-dev   
Hardware: All   
OS: All   
Bug Depends on: 931    
Bug Blocks:    
Attachments: Stream-flushing handler, not as prone to bad ostream pointers
Patch to flush pcap streams
Changes (mostly) as requested.

Description Quincy Tse 2010-06-03 01:19:24 UTC
+++ This bug was initially created as a clone of Bug #931 +++

Just taking the discussion and implementation of the stream-flushing part of #931 off to a new bug.
Comment 1 Quincy Tse 2010-06-03 01:26:50 UTC
Created attachment 907 [details]
Stream-flushing handler, not as prone to bad ostream pointers

Similar to attachment #906 [details] to Bug #931, except I've also added code to trap SIGSEGV when the list of ostream* are being dereferenced. The handler loops back to the original handler to flush the next stream in the list. There is a risk of stack problem if the list of ostream* is long, and many of them are bad. If we can assume that most people are good and unregister "expired" ostreams, then it should be fine.
Comment 2 Tom Henderson 2010-06-03 02:18:15 UTC
Carrying over from bug 931.
> 
> > but wouldn't it be simpler to just
> > have the user toggle std::ios_base::unitbuf (or explicit flush()) on all trace
> > streams when the user runs into a problematic program?
> 
> That's one option (I'm not sure about the extra IO cost - I haven't measured
> it). The problem here would be that many people would just rely on streams that
> are opened behind the scene (eg. ascii/pcap traces, etc). I wouldn't expect an
> NS3 user would go around and tinker with someone elses' code just to get the
> traces out.

Yes, I would suggest that the user API would be similar to how we enable checksums (see GlobalValue g_checksumEnabled in node.cc) so that users don't have to edit the streams opened behind the scenes.

> 
> > Also, wouldn't that
> > option also take care of the case in which memory leaks cause traces not to be
> > written upon exit?  (unless missing traces are useful as a crude leak
> > detector...)
> 
> By memory leak, I assume you're referring to the case when destructor was not
> called? (eg. printing summary when some statistic class was destroyed) Turning
> off write buffer can't help with things that have never been put into the
> stream yet (ie the case I'm referring to). Also, if the program crashes,
> destructors won't normally be called anyway.

when there is a leak on (non-crashing) exit, references aren't all cleared and streams aren't flushed, with the result that pcap or ascii files are truncated or empty
Comment 3 Quincy Tse 2010-06-03 02:26:58 UTC
Should have done this earlier... Just carrying over some relevant comments from Bug #931;

  -------  Comment #10 From  Mathieu Lacage   2010-06-01 05:01:37 EDT   (-) [reply] -------

(In reply to comment #9)
> Another workaround might be to add functions to register/unregister the
> ostreams under assert.h/.cc. I think catering for FILE* and ostream* should be
> generic enough to be included there (we probably don't need FILE* version).
> 
> The onus will then be on whoever using the assert and want certain streams be
> flushed to register and unregister them. Unregistering will be more of a
> problem as you don't want seg faults in your abort routine, and normal testing
> won't pick it up (won't see the seg fault unless assert failed).

Ok. Make sure you call these register/unregister functions from
PcapFile::PcapFile and PcapFile::~PcapFile

------- Comment #11 From Antti Mäkelä 2010-06-01 05:12:50 EDT (-) [reply] -------

(In reply to comment #10)
> Ok. Make sure you call these register/unregister functions from
> PcapFile::PcapFile and PcapFile::~PcapFile

  What about ASCII traces?

------- Comment #12 From Mathieu Lacage 2010-06-01 05:17:24 EDT (-) [reply] -------

add an Open and Close method to OutputStreamWrapper, call them from
TraceHelper, add register/unregister calls to its open/close methods.

------- Comment #13 From Mathieu Lacage 2010-06-01 05:17:47 EDT (-) [reply] -------

(In reply to comment #12)
> add an Open and Close method to OutputStreamWrapper, call them from
> TraceHelper, add register/unregister calls to its open/close methods.

and, please, kill SetStream.
Comment 4 Gustavo J. A. M. Carneiro 2010-06-03 07:05:26 UTC
I just saw this bug report and I don't think it makes any sense.  Signals and failed assertions denote program bugs.  As soon as the bugs are fixed, the signals and assertions will no longer happen and you can run your simulation and extract the results.  No need to flush trace files IMHO.  That is just encouraging programmers to not fix bugs...
Comment 5 Quincy Tse 2010-06-03 07:30:51 UTC
(In reply to comment #4)
> No need to flush trace files IMHO.  That is just encouraging programmers
> to not fix bugs...

The idea behind flushing the trace was to help locate the bug. It was a suggestion from bug #931. For example, an assertion fail and the backtrace might tell you that one of the pointers is null when it's not meant to be. If it happens in a section of code that is common to all 100 nodes in the simulation, but one one have this fault, it'll be hard to debug. The trace might be able to narrow it down to something like the fault occurred after Node X (which might be a special node) received a packet, leading you to check the init routine for that node.
Comment 6 Mathieu Lacage 2010-06-03 07:41:23 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > No need to flush trace files IMHO.  That is just encouraging programmers
> > to not fix bugs...
> 
> The idea behind flushing the trace was to help locate the bug. It was a

agreed.
Comment 7 Mathieu Lacage 2010-06-03 07:43:14 UTC
(In reply to comment #2)
> > > but wouldn't it be simpler to just
> > > have the user toggle std::ios_base::unitbuf (or explicit flush()) on all trace
> > > streams when the user runs into a problematic program?
> > 
> > That's one option (I'm not sure about the extra IO cost - I haven't measured
> > it). The problem here would be that many people would just rely on streams that
> > are opened behind the scene (eg. ascii/pcap traces, etc). I wouldn't expect an
> > NS3 user would go around and tinker with someone elses' code just to get the
> > traces out.
> 
> Yes, I would suggest that the user API would be similar to how we enable
> checksums (see GlobalValue g_checksumEnabled in node.cc) so that users don't
> have to edit the streams opened behind the scenes.

The two features are orthogonal: it makes sense to use line buffering but it also makes sense to flush upon process exit/assert/abort.
Comment 8 Gustavo J. A. M. Carneiro 2010-06-03 07:44:34 UTC
I usually prefer to add/fix debug log messages, and redirect log output to a file.  Redirecting stderr seems to work fine without explicit flushing...
Comment 9 Mathieu Lacage 2010-06-03 08:01:02 UTC
(In reply to comment #1)
> Created an attachment (id=907) [details]
> Stream-flushing handler, not as prone to bad ostream pointers


quick comments:

1) replace this:

+namespace ns3 {
+
+void _registerStream (std::ostream& stream);
+
+void _unregisterStream (std::ostream& stream);
+
+void _handleTerminate (void);
+
+}

with:

namespace ns3 {

namespace FatalError {

void RegisterStream (std::ostream *os);

void UnregisterStream (std::ostream *os);

void FlushStreams (void);

}

}

sample implementation:

static void EnsureSegvHandlerRegistered (void)
{
  static bool initialized = false;
  if (initialized)
    {
      return;
    }
  initialized = true;
  // register segv handler here
}

static std::list<std::ostream *> GetStreamList (void)
{
  static std::list<std::ostream *> l;
  return &l;
}
void
RegisterStream (std::ostream *os)
{
   GetStreamList ()->push_back (os);
}
....

2) in NS_FATAL_ERROR, call FlushStreams () before calling abort or terminate.
Comment 10 Mathieu Lacage 2010-06-03 08:01:48 UTC
(In reply to comment #8)
> I usually prefer to add/fix debug log messages, and redirect log output to a
> file.  Redirecting stderr seems to work fine without explicit flushing...

This is orthogonal to the need of being able to look at the output of a pcap or an ascii file when something crashes.
Comment 11 Quincy Tse 2010-06-03 08:10:10 UTC
I'll look into point 1 a bit more tomorrow (it's 10pm at Sydney at the moment...)

(In reply to comment #9)
> 2) in NS_FATAL_ERROR, call FlushStreams () before calling abort or terminate.

std::set_terminate() sets a terminate handler (different to a signal handler), which can be/call the FlushStreams(). That way, NS_FATAL_ERROR is a macro that will not depend on the any functions, and can allow NS3 users to implement their own terminate handler (in case they need to do some other clean up). They can  call the default terminate handler within their handler if they want stream flushing.
Comment 12 Antti Mäkelä 2010-06-03 08:14:11 UTC
(In reply to comment #8)
> I usually prefer to add/fix debug log messages, and redirect log output to a
> file.  Redirecting stderr seems to work fine without explicit flushing...

  Yes, but for example I had really hard time recently debugging a MIP implementation with my custom headers. I tried doing this with GDB and looking into the Ptr objects, but this was much harder than just seeing the same packet causing failure at Deserialize show up in Wireshark.
Comment 13 Quincy Tse 2010-06-03 08:16:24 UTC
(In reply to comment #11)
> std::set_terminate() sets a terminate handler (different to a signal handler),
> ...

Forgot the more important bit - when std::terminate is called, it will trigger a call to the terminate handler. Therefore, I don't need to explicitly call FlushStreams() before std::terminate (But I do if I use std::abort).
Comment 14 Craig Dowell 2010-06-03 08:46:41 UTC
Created attachment 909 [details]
Patch to flush pcap streams

I think this is crazy.

All you really have to do to get the desired behavior is to add four lines of code to PcapFile.cc

When an assert pops the last thing I want to see is the system wandering around doing other things and possibly crashing again.

Can we just do a four line fix and move on instead of creating this monstrosity?

I have attached a patch that provides clean traces when you stick a "*(char *)0 = 0" in the code.
Comment 15 Quincy Tse 2010-06-03 23:16:23 UTC
(In reply to comment #14)
> When an assert pops the last thing I want to see is the system wandering around
> doing other things and possibly crashing again.

Agreed. I wouldn't want memory to be modified from the time the error was generated to the time I'm actually notified. I would love to have a fflush(0) that works for all open ostream.

> Can we just do a four line fix and move on instead of creating this
> monstrosity?

I think that won't cover Ascii traces though. I understand that Ascii traces use OutputStreamWrapper, which you can't easily modify to flush other than changing it to non-buffered output.
Comment 16 Quincy Tse 2010-06-04 01:16:35 UTC
(In reply to comment #9)
> static void EnsureSegvHandlerRegistered (void)
> {
>   ...
> }

I'm not sure what this function is used for. SIGSEGV handler is only required for the duration of FlushStreams in case there are bad ostream*

> 2) in NS_FATAL_ERROR, call FlushStreams () before calling abort or terminate.

Not required as long as a handler had been registered to be triggered whenever std::terminate() is called.
Comment 17 Quincy Tse 2010-06-04 02:46:36 UTC
Created attachment 912 [details]
Changes (mostly) as requested. 

This patch implements most of the items in comment #19. I also added code to cause std::set_terminate() to run on startup.

Anyhow, the more I think about it, the more I think having pointers into main simulator memory can become problematic due to the potential of corrupting memory.
Comment 18 Mathieu Lacage 2010-06-04 04:42:21 UTC
(In reply to comment #14)

> I have attached a patch that provides clean traces when you stick a "*(char *)0
> = 0" in the code.

This does not work from a performance perspective.
Comment 19 Mathieu Lacage 2010-06-04 04:52:06 UTC
(In reply to comment #17)
> Created an attachment (id=912) [details]
> Changes (mostly) as requested. 
> 
> This patch implements most of the items in comment #19. I also added code to
> cause std::set_terminate() to run on startup.

This kind of code is not robust to other global constructors calling NS_FATAL_ERROR before this constructor runs. Please, use the pattern below:

static void EnsureHandlerRegistered (void)
{
  static bool initialized = false;
  if (initialized)
    {
      return;
    }
}

> Anyhow, the more I think about it, the more I think having pointers into main
> simulator memory can become problematic due to the potential of corrupting
> memory.

We can revert this change later if we get problematic reports.
Comment 20 Craig Dowell 2010-06-04 10:29:11 UTC
> This does not work from a performance perspective.

What is the real criterion for "does not work from a performance perspective"?  You just ripped out an implementation of pcap file writing that was significantly faster than the current one, but that worked just fine.  What are the numbers?  How are you justifying this comment? How are you justifying all of this complexity?  If you have real numbers showing that packet-by-packet flushes cause a horrible problem, then use a global variable that one can set while debugging.  How about if I say, "handling stream flushing in an assert handler does not work from a sofware engineering perspective"?  Is that enough for you?

We are fixing non-bugs with a very complicated scaffolding.  If we want to change the return code on asserts, fine.  If we want to flush streams, we should flush streams just like in the rest of the known universe.

Doing all of this stuff in an assertion handler is, IMO insane.  You are now special-casing a number of "exceptions" like fatal errors and aborts.  You haven't gotten to control-C or other singals or termination with memory leaks which will have the same effect of not flushing the streams, right?  Typically, users see the flush problem when they have memory leaks.

IMO you are solving a part of the problem in about the most complicated way possible.
Comment 21 Craig Dowell 2010-06-04 13:18:16 UTC
I'll make a few final comments and then fade back into the background.

1) There is a generic problem with trace file flushing that appears in several contexts.  The most troubling case to me is when a memory leak in a completely unrelated area prevents pcap logging from working correctly.  It would be good to deal with all of these contexts in one go.

2) Simply always flushing all of the streams all of the time is too simplistic and raises performance issues; but when we want the streams flushed, we are debugging a problem of one form or another.  So it seems reasonable to have a simple switch to flip to force the streams to be flushed to give us more information when we decide the performance impact is less important than the data.

3) Adding an explicit flush to our trace code deals with the problem in all cases including signals like control-C and memory leak.

4) Adding exception handlers to flush the streams addresses the problems in a case-by-case basis, but doesn't cover all of the cases (memory leak case especially).

5) When I am debugging a crash, I don't want the system running around trying to salvage things before it stops.  I want it to simply stop without changing anything and let me see exactly what state things were in.  I don't have any problem with throwing a segmentation violation, BTW.

I think that explicit flushes with an on/off switch as a global variable is the most straightforward way to address the issue; and trying to do the work in an exception handler is more complicated than necessary and doesn't address all of the cases.  The same treatment as the flush on write could be easily applied to ascii traces if we decide they are equally important to have flushed.

My $.02

---------- Benchmark Data ----------

I did some benchmarks involving examples/csma-star which is our example that uses tracing the most.  There is a roughly 20% increase in execution time when a pcap flush bit is turned on there (about a half-second difference on my machine from 3 to 3.5 seconds to run the example).  This isn't an intolerable time if it lets one see previously lost data, IMO.  However, this is a humanly noticeable difference, so we shouldn't just leave the flush bit on all the time (although I could appreciate a case for leaving it on in debug code).

I also did micro-benchmarks looking at the effect of the flush on packet writing.  I wrote a thousand packets to a pcap file looking for how long it took; repeated this 50 times to get a good minimum; and looked at 100-byte and 1500-byte packets.

The answer, of course, varies with the average packet size.  not buffering small packets causes much more of a problem than large.

Flushing on every 1500-byte packet takes about 48% more time (in the pcap write method) than letting the buffer cache work.

Flushing on every 100-byte packet takes about 7 times as long as just letting the buffer cache work.

This indicates that a flush control bit is needed -- the performance impacts are, as expected, not small enough to ignore completely.
Comment 22 Gustavo J. A. M. Carneiro 2010-06-04 13:42:00 UTC
I agree with Craig that when a serious error happens I don't want the system to keep running just a bit more to try to flush things.  If memory is corrupted, running just a bit longer could very well end up masking the original error by accident.

Adding signal handlers makes me very uneasy.  Incidentally, are you guys aware you can safely do almost nothing in a signal handler?  Any system call that blocks will deadlock the program when called inside a signal handler...  See also [1].  Signal handling is full of pitfalls and portability concerns...

And I have never used ascii or pcap traces to debug any of my problems; it's always been via NS_LOG and lots of grepping, and lots of gdb.

I would prefer to flush streams only for failed assertions, and not for generic signals.


[1] https://www.securecoding.cert.org/confluence/display/seccode/SIG31-C.+Do+not+access+or+modify+shared+objects+in+signal+handlers
Comment 23 Antti Mäkelä 2010-06-05 02:56:34 UTC
(In reply to comment #21)
> This indicates that a flush control bit is needed -- the performance impacts
> are, as expected, not small enough to ignore completely.

  I have to say that this sounds like most feasible idea - when debugging, just have a possibility to flush all trace files after every write. Maybe it should even be made a default setting in debug builds?

  Preferably have it a toggle that can be set run-time (e.g. saying Simulator::Schedule(Seconds(2000), &SetFlushingOn)). If I know that a crash occurs at t=2005s, I'd rather not have to spend time with flushes in the first 2000 secs that are working ok...
Comment 24 Mathieu Lacage 2010-06-05 03:18:26 UTC
Ok, lots of comments. I will summarize them from my perspective:

1) the issue of what we do in case an assert is raised is orthogonal from whether we allow users to have a global control bit to enable always-flush-on-write. They provide different solutions with different pros and cons to similar problems and each will be useful in a different case.

2) the global control bit to enable always-flush-on-write has one big weakness: it requires that you use it correctly everywhere you use std::ostream. It's not impossible and we can do this in both our ascii trace writing code and the pcap writing code but it will be a bit tricky to enforce that people actually do this. What might work is making PcapWriter use the OutputStream thingy (used by the ascii stuff) and try to re-implement a bunch of forwarding functions on this class to force users to go through the class (and thus have the global switch in the class' write code) rather than access the raw std::ostream (for which we cannot control usage).

3) I don't buy the arguement that it's un-acceptable to try to do some best-effort work on NS_ASSERT triggers: I agree with gustavo that the segfault override is overkill and should be removed but the overall approach has nothing utterly stupid. It's a nice touch and it could help in some cases. I would have helped me quite a bit when I was debugging some things. So, it's not perfect (no one claimed so), it's not going to solve the world's hunger problems but it's going to help in some cases with no obvious downsides (there are theoretical downsides but I don't believe they will kick in in practice). If some real downsides come in later, we can deal with them _when they happen_, maybe by reverting part of quincy's patch.

To summarize:
 - I support quincy's last patch and I intend to merge it without the signal segfault handling.
 - I support the idea of adding a global 'flush-on-write' flag if it's implemented solely in a single class (i.e., we don't start adding explicit checks against this flag in a bunch of unrelated places). I believe that it's possible with some minor refactoring of the OutputStream thingy (mostly, expanding its scope) and patches to do this are welcome.
Comment 25 Mathieu Lacage 2010-06-05 03:21:26 UTC
(In reply to comment #22)

> I would prefer to flush streams only for failed assertions, and not for generic
> signals.

Agreed: I should point out that none of the proposed patches did try to flush streams for generic signals. They merely tried to do something in case a segfault happens while flushing which I think is not a good idea anyway.
Comment 26 Mathieu Lacage 2010-06-22 09:44:32 UTC
changeset 0a0b6bf5fdfd
Comment 27 Quincy Tse 2010-06-22 21:29:55 UTC
Changeset is missing fatal-impl.*