Bug 1563

Summary: reduce valgrind test scope
Product: ns-3 Reporter: Tom Henderson <tomh>
Component: test frameworkAssignee: Mitch Watrous <watrous>
Status: RESOLVED FIXED    
Severity: enhancement CC: nicola, ns-bugs, tomh, vedran, watrous
Priority: P5    
Version: pre-release   
Hardware: All   
OS: Linux   
Attachments: list of valgrind test durations, for test.py
get rid of default argument value in TestSuite::AddTestCase
another attempt to get rid of default argument value in AddTestCase

Description Tom Henderson 2012-12-31 18:28:14 UTC
test.py with valgrind enabled takes in excess of 9 hours to run on our buildslaves.  It has noticeably slowed after the GSOC LTE schedulers were merged.

I recently ran this on one of our (non-virtualized) workstations and found that the following tests and examples are taking greater than 1000 seconds each to run:

PASS (2270.145): TestSuite lte-rr-ff-mac-scheduler
PASS (3174.109): TestSuite lte-tdmt-ff-mac-scheduler
PASS (3408.120): TestSuite lte-fdmt-ff-mac-scheduler
PASS (3464.267): TestSuite lte-pf-ff-mac-scheduler
PASS (3330.301): TestSuite lte-tta-ff-mac-scheduler
PASS (2849.358): TestSuite lte-fdbet-ff-mac-scheduler
PASS (3191.014): TestSuite lte-tdbet-ff-mac-scheduler
PASS (1207.467): TestSuite lte-rlc-am-e2e
PASS (2805.263): TestSuite lte-epc-e2e-data
PASS (11082.303): TestSuite lte-fdtbfq-ff-mac-scheduler
PASS (10483.255): TestSuite lte-tdtbfq-ff-mac-scheduler
PASS (1846.024): Example /a/users/tomh/ns-3-allinone/ns-3-dev/build/src/lte/examples/ns3-dev-lena-intercell-interference-debug
PASS (10945.489): TestSuite lte-pss-ff-mac-scheduler
PASS (1556.757): Example /a/users/tomh/ns-3-allinone/ns-3-dev/build/src/uan/examples/ns3-dev-uan-cw-example-debug

I am proposing to disable the above tests from valgrind testing (but retain them for the non-valgrind case).  There are many other LTE tests that will run in valgrind even with this change.

It would also be preferable to reduce the running time of these LTE scheduler tests for the non-valgrind case, particularly the pss and the tbfq types.
Comment 1 Tom Henderson 2012-12-31 18:30:21 UTC
Created attachment 1492 [details]
list of valgrind test durations, for test.py

stdout from running "test.py -d -g" on ns-3-dev
Comment 2 Nicola Baldo 2013-01-04 13:21:04 UTC
For a quick solution I think we could accept to remove these scheduler tests from valgrind (no idea on how, though), but only if we add a minimum of valgring coverage for each scheduler. For example, we could do it by adding to examples-to-run.py something like this:

("lena-simple-epc  --simTime=1.1 --ns3::LteHelper::Scheduler=ns3::PssFfMacScheduler", "True", "True"),

and repeating the above for each scheduler. Otherwise most schedulers will be left completely untested from a valgrind perspective. The reason is that, once we excluded the scheduler test suites, most other LTE test suites use the PF scheduler, a few use RR, and no other scheduler is ever used.

For a more long term perspective, I think that we need to rework the test categorization to support a more useful testing granularity. Currently we classify TestSuite like UNIT, SYSTEM, PERFORMANCE etc, maybe we could switch to a time-based categorization like QUICK, EXTENSIVE, TAKES_FOREVER etc., or maybe based on test importance. Additionally, it would be good to have some granularity for TestCase as well, so we can reduce testing time by running fewer test cases, without excluding an entire TestSuite.
Comment 3 Tom Henderson 2013-01-10 09:54:51 UTC
(In reply to comment #2)
> For a quick solution I think we could accept to remove these scheduler tests
> from valgrind (no idea on how, though), but only if we add a minimum of
> valgring coverage for each scheduler. For example, we could do it by adding
> to examples-to-run.py something like this:
> 
> ("lena-simple-epc  --simTime=1.1
> --ns3::LteHelper::Scheduler=ns3::PssFfMacScheduler", "True", "True"),
> 
> and repeating the above for each scheduler. Otherwise most schedulers will
> be left completely untested from a valgrind perspective. The reason is that,
> once we excluded the scheduler test suites, most other LTE test suites use
> the PF scheduler, a few use RR, and no other scheduler is ever used.

I will make this change.

> 
> For a more long term perspective, I think that we need to rework the test
> categorization to support a more useful testing granularity. Currently we
> classify TestSuite like UNIT, SYSTEM, PERFORMANCE etc, maybe we could switch
> to a time-based categorization like QUICK, EXTENSIVE, TAKES_FOREVER etc., or
> maybe based on test importance. Additionally, it would be good to have some
> granularity for TestCase as well, so we can reduce testing time by running
> fewer test cases, without excluding an entire TestSuite.

I will leave this open (retitled to something more meaningful) and have a look later.
Comment 4 Tom Henderson 2013-01-14 00:15:31 UTC
I added some "valgrind only" examples as Nicola suggested, in changeset f09a0822157e.  I disabled the scheduler test suites for valgrind runs only.  I did not make any changes to the other long running tests and examples that are not related to the LTE scheduler.

Some light testing that I did suggests that this change alone will reduce run time by 2/3.  I will keep this open to look at further valgrind test and example reductions, and then retitle it to remember that we may want to add another level of valgrind test granularity.
Comment 5 Vedran Miletić 2013-01-16 07:11:22 UTC
+1 on something like QUICK, EXTENSIVE, TAKES_FOREVER.
Comment 6 Mitch Watrous 2013-01-18 18:16:37 UTC
Users can already skip slow tests in valgrind using the do_valgrind_run variable in examples_to_run.py like this:

    # A list of C++ examples to run in order to ensure that they remain
    # buildable and runnable over time.  Each tuple in the list contains
    #
    #     (example_name, do_run, do_valgrind_run).
    #
    # See test.py for more information.
    cpp_examples = [
        ("lena-cqi-threshold", "True", "False"),
        ("lena-dual-stripe --simTime=0.01", "True", "False"),
        ("lena-dual-stripe --epc=1 --simTime=0.01", "True", "False"),
        ...
        ]

Note that I changed all of the third elements in the tuples to be False here.

Setting do_valgrind_run = False is logically equivalent to skipping a TAKES_FOREVER test.

A labeling system won't make slow tests faster; only improving those tests will do that.

The only real solution is to make those slow valgrind tests faster.

If you allow the user to skip the slow tests, you allow them to not run the most important tests such as these very slow LTE valgrind tests that need to be run in this case to make sure all of the different schedulers are tested.

Maybe we could have a buildslave job that runs all of the really slow valgrind jobs but only every week?
Comment 7 Tom Henderson 2013-01-18 19:08:30 UTC
(In reply to comment #6)
> Users can already skip slow tests in valgrind using the do_valgrind_run
> variable in examples_to_run.py like this:
> 
>     # A list of C++ examples to run in order to ensure that they remain
>     # buildable and runnable over time.  Each tuple in the list contains
>     #
>     #     (example_name, do_run, do_valgrind_run).
>     #
>     # See test.py for more information.
>     cpp_examples = [
>         ("lena-cqi-threshold", "True", "False"),
>         ("lena-dual-stripe --simTime=0.01", "True", "False"),
>         ("lena-dual-stripe --epc=1 --simTime=0.01", "True", "False"),
>         ...
>         ]
> 
> Note that I changed all of the third elements in the tuples to be False here.
> 
> Setting do_valgrind_run = False is logically equivalent to skipping a
> TAKES_FOREVER test.

The idea is not to disable valgrind testing altogether but to allow the buildslaves to select the scope/depth of coverage, so we don't have 9 hours of valgrind testing or none at all.

> Maybe we could have a buildslave job that runs all of the really slow
> valgrind jobs but only every week?

Exactly.

What I have in mind is that we introduce the "-gg" option for deeper testing:

./test.py -gg ...

We already have a list in test.py for tests to exclude from valgrind (e.g. NSC).  We can add a second list in test.py for tests to exclude from valgrind unless -gg option is specified.

The main need is for the test suites, but the examples_to_run.py parsing could be extended also to allow for another list element; e.g.

run lena-dual-stripe under "test.py -g":

         ("lena-dual-stripe --simTime=0.01", "True", "True"),

run lena-dual-stripe under "test.py -gg":

         ("lena-dual-stripe --simTime=0.01", "True", "True", "True"),

this optional fifth list element ("do_deep_valgrind_run") could be robustly handled by test.py if it is not present, allowing us to avoid touching all examples_to_run.py.  However, I care more about the test suites than the examples for this bug.

Any other opinions on this?
Comment 8 Mitch Watrous 2013-03-14 18:42:11 UTC
This ns-3-dev changeset:

    Allow very slow test cases to be skipped

allows very slow test cases to be skipped, which is what was requested for this bug.  

Before only whole test suites could be skipped.

Now, single test cases can be skipped also.

A second argument, takesForever, was added to this function to allow test cases to be skipped:

  /**
   * \brief Add an individual test case to this test suite.
   *
   * \param testCase Pointer to the test case object to be added.
   * \param takesForever Set equal to true if this test case takes a
   * long time to run and should be skipped unless the --full option
   * is specified for the test-runnr.
   */
  void AddTestCase (TestCase *testCase, bool takesForever = false);

For example, this would mark the 2nd test case in the timer test suite as being very slow and should usually be skipped unless the new --full option was used with test.py:

  TimerTestSuite ()
    : TestSuite ("timer", UNIT)
  {
    AddTestCase (new TimerStateTestCase ());
    AddTestCase (new TimerTemplateTestCase (), true);
  }

There is also a new test.py option to support this:

  -f, --full   run the full set of tests including slow ones

Now, the 2nd test case above would be skipped during the following test.py run using valgrind because --full was not specified), which would run most of the tests under valgrind but would skip all of the ones marked as taking forever:

./test.py --grind

The following command would run all of the tests including the ones marked as taking forever, which would include the 2nd test case that been skipped during the valgrind run above:

./test.py --full
Comment 9 Mitch Watrous 2013-03-14 18:49:38 UTC
The test cases that are currently slowing down the valgrind runs in the lte module should be flagged as taking forever to see if this makes the normal valgrind runs, i.e. without --full set, be fast enough that this bug can be closed.

That is why I am reassigning this bug to Nicola so that he can do this.
Comment 10 Tom Henderson 2013-03-15 18:58:16 UTC
The existing option could be extended a bit to make it more extensible and future-proof, if we did not make this argument a boolean but instead an integer:

instead of:

  -f, --full   run the full set of tests including slow ones

we did:

 -f LEVEL, --fullness=LEVEL   control the level of test case coverage

We could set the default perhaps to "1", put the currently slow tests at "2", and allow growth to define a future "quick" category or multiple levels of "slowness" in the future.
Comment 11 Mitch Watrous 2013-03-15 19:51:56 UTC
You are proposing to have the users do this:

  TimerTestSuite ()
    : TestSuite ("timer", UNIT)
  {
    AddTestCase (new TimerStateTestCase (),    1);
    AddTestCase (new TimerTemplateTestCase (), 2);
  }

How do you know which is slow and which is fast (2 or 1)?

How does the user from test.py know what the levels are that appear in the unit tests?

Maybe an enum is better:

  enum TestDuration {
    QUICK,
    EXTENSIVE,
    TAKES_FOREVER
  };

There are 2 problems with using enums:

(1.) How do you specify a certain level of tests to run if, for example, you want run QUICK and EXTENSIVE tests but not TAKES_FOREVER, i.e. I can determine in C++ if a test's level is greater than the level passed in from test.py, 

    if (test.m_level > fullNess)

but that doesn't necessarily work for an enum.

(2.) The other problem is with enums is that test.py and Python don't know about enum's.

I think that as long as the current implementation (using booleans) solves the valgrind timing issue that it should be sufficient.
Comment 12 Vedran Miletić 2013-03-16 05:52:06 UTC
The enum idea is kind of what we had agreed on the DevelMeeting.
Comment 13 Nicola Baldo 2013-03-18 09:31:09 UTC
I am with Tom and Vedran here. Three levels of scope is what we agreed at the developers meeting. 

To give you an idea, this is what I would do with the LTE model for most test suites once the three levels are available:

- activate only 1 or 2 test cases for the QUICK level, so it takes just a few seconds to run. This test would be used during development to quickly check that things still work ok and there is no major break. 

- leave most of the other test cases to the EXTENSIVE level, taking out just those which are most time consuming. The target test execution time could be 5-15 minutes. This is the typical test that would be run before commit, and also for daily automated testing.

- put the existing time consuming tests into TAKES_FOREVER, and then also add some cases that I did not add previously because such flag was not available. The target test execution time could be as long as a few hours, to be executed seldom, say before a release and possibly once per week or month.


For the implementation:

a) I agree we need to pass to AddTestCase something self-explaining, i.e., an enum or a constant, not just a number.

b) to avoid problem (1.), we could assign explicitly a value to each enum; as for problem (2.), sorry I don't have a clue... I don't use python that much ;-)

c) I would not assign a default value to the parameter passed to AddTestCase; instead I am in favor of changing all the existing calls to AddTestCase so they explicitly use a value (OK to do search/replace and put EXTENSIVE everywhere). The rationale: I think most people write tests just by copy-paste of an existing test, so they will never learn to use the new variable test scope if we leave legacy usages of AddTestCase around. 


Re-assigning to Mitch for revising the patch. I'll be glad to take care of the LTE module afterwards.
Comment 14 Mitch Watrous 2013-03-19 17:17:18 UTC
This ns-3-dev changeset:

    Allow three speeds of test cases to be skipped

allows three levels of test cases to be skipped based on their speed.

The three levels are specified using this enum in C++:

    /**
     * \enum TestDuration
     * \brief How long the test takes to execute.
     */
    enum TestDuration {
      QUICK         = 1,  /// Fast test.
      EXTENSIVE     = 2,  /// Medium length test.
      TAKES_FOREVER = 3   /// Very long running test.
    };

A new second argument, duration, was added to this function to specify the test duration:

    /**
     * \brief Add an individual test case to this test suite.
     *
     * \param testCase Pointer to the test case object to be added.
     * \param duration Amount of time this test takes to execute.
     */
    void AddTestCase (TestCase *testCase, enum TestDuration duration = QUICK);

For example, this would mark the 1st test case in the timer test suite as being moderately slow and mark the 2nd test case as being very, very slow.  These test cases would usually be skipped unless the proper --fullness option was chosen with test.py:

    TimerTestSuite ()
      : TestSuite ("timer", UNIT)
    {
      AddTestCase (new TimerStateTestCase (), TestCase::EXTENSIVE);
      AddTestCase (new TimerTemplateTestCase (), TestCase::TAKES_FOREVER);
    }

There is also a new test.py option to support this with this help message:

    -f FULLNESS, --fullness=FULLNESS
                          choose the duration of tests to run: QUICK,
                          EXTENSIVE, or TAKES_FOREVER, where EXTENSIVE
                          includes QUICK and TAKES_FOREVER includes QUICK
                          and EXTENSIVE (only QUICK tests are run by
                          default)
  
These test.py options are chosen as follows:

    ./test.py --fullness=QUICK

or

    ./test.py --fullness=EXTENSIVE

or
    ./test.py --fullness=TAKES_FOREVER

Now, the 1st and 2nd test cases above would be skipped during the following test.py run using valgrind because --fullness was not specified, which runs only quick tests by default:

    ./test.py --grind

The following command would run all of the tests including the ones marked as
extensive and those taking forever, which would include the 1st and 2nd test cases that had been skipped during the valgrind run above:

    ./test.py --fullness=TAKES_FOREVER
Comment 15 Mitch Watrous 2013-03-19 19:06:37 UTC
> What I have in mind is that we introduce the "-gg" option for deeper testing:
> 
> ./test.py -gg ...
> 
> We already have a list in test.py for tests to exclude from valgrind (e.g.
> NSC).  We can add a second list in test.py for tests to exclude from valgrind
> unless -gg option is specified.
> 
> The main need is for the test suites, but the examples_to_run.py parsing could
> be extended also to allow for another list element; e.g.
> 
> run lena-dual-stripe under "test.py -g":
> 
>          ("lena-dual-stripe --simTime=0.01", "True", "True"),
> 
> run lena-dual-stripe under "test.py -gg":
> 
>          ("lena-dual-stripe --simTime=0.01", "True", "True", "True"),
> 
> this optional fifth list element ("do_deep_valgrind_run") could be robustly
> handled by test.py if it is not present, allowing us to avoid touching all
> examples_to_run.py.  However, I care more about the test suites than the
> examples for this bug.
> 
> Any other opinions on this?

If you want to add another element to this tuple in examples-to-run.py:

     (example_name, do_run, do_valgrind_run)

i.e. to add a fourth element as follows:

     (example_name, do_run, do_valgrind_run, example_duration)

then you will have to modify all of the exsiting examples-to-run.py files because they now all have this comment before the list of C++ examples:

    # A list of C++ examples to run in order to ensure that they remain
    # buildable and runnable over time.  Each tuple in the list contains
    #
    #     (example_name, do_run, do_valgrind_run).
    #
    # See test.py for more information.
    cpp_examples = [
        ("adhoc-aloha-ideal-phy", "True", "True"),
        ("adhoc-aloha-ideal-phy-with-microwave-oven", "True", "True"),
        ("adhoc-aloha-ideal-phy-matrix-propagation-loss-model", "True", "True"),
    ]

So, even if you modified test.py to work with the old files, the comments in these files would be all be wrong and need to be updated.

In other words, I can make test.py be backwards compatible to handle the changed tuples, but I can't make the examples-to-run.py files be correct and backwards compatable because of these comments.

You can already prevent valgrind run for an example using this variable in the tuple:

    do_valgrind_run

The problem with the test suites was that you didn't have control over individual test cases within the test suite and couldn't turn off just one.  

But, that doesn't apply to examples because they don't have subexamples like test cases that you need to turn off anyway.

I think we should leave the examples-to-run.py files like they are, becase I don't see the value added by having 3 different flavors by duration of examples as compared to the amount of time it would take to edit them all.
Comment 16 Nicola Baldo 2013-03-20 06:39:00 UTC
Created attachment 1535 [details]
get rid of default argument value in TestSuite::AddTestCase

Thanks Mitch for the new changeset. I've generated an additional patch that removes the default argument value from TestSuite::AddTestCase, and that modified all the occurrences of this method so they use TestCase::QUICK. 
I already explained the rationale for this change in my previous post. If there is no major complaint, I would go ahead and push it.

As for adding a duration examples-to-run.py, maybe we don't need it. Anyway, I would defer this discussion to after we finalized the TestCase duration issue, including a revision of the duration of the existing test cases, possibly in all modules.
Comment 17 Mitch Watrous 2013-03-20 13:02:23 UTC
(In reply to comment #16)
> Created attachment 1535 [details]
> get rid of default argument value in TestSuite::AddTestCase
> 
> Thanks Mitch for the new changeset. I've generated an additional patch that
> removes the default argument value from TestSuite::AddTestCase, and that
> modified all the occurrences of this method so they use TestCase::QUICK. 
> I already explained the rationale for this change in my previous post. If there
> is no major complaint, I would go ahead and push it.
> 
> As for adding a duration examples-to-run.py, maybe we don't need it. Anyway, I
> would defer this discussion to after we finalized the TestCase duration issue,
> including a revision of the duration of the existing test cases, possibly in
> all modules.

I wouldn't do this part of patch, which gets rid of the default value for duration, because this makes the AddTestCase() changes not backwards compatible with existing code, i.e. all current instances of this funtion will have to be changed because they will no longer compile:

diff -r 0455e96a3cca src/core/model/test.h
--- a/src/core/model/test.h	Tue Mar 19 13:14:12 2013 -0700
+++ b/src/core/model/test.h	Wed Mar 20 11:21:21 2013 +0100
@@ -858,7 +858,7 @@
    * \param testCase Pointer to the test case object to be added.
    * \param duration Amount of time this test takes to execute.
    */
-  void AddTestCase (TestCase *testCase, enum TestDuration duration = QUICK);
+  void AddTestCase (TestCase *testCase, enum TestDuration duration);
 
   /**
    * \param directory the directory where the test data is located

I don't have a problem with the rest of the patch that spececfies the duration for all of the existing test cases.  I think those changes would prevent users from copying and pasting a test case with the duration not specified because all of the existing test cases would now specify their duration, which is what Nicola wanted to prevent.

I don't think you need to get rid of the default parameter to achieve that goal, however
Comment 18 Nicola Baldo 2013-03-20 15:03:50 UTC
Created attachment 1536 [details]
another attempt to get rid of default argument value in AddTestCase

I've overloaded AddTestCase, now the old method signature is still supported but marked as NS_DEPRECATED, and the new method signature is there, without the default argument value.

Mitch, does this revised patch address your concerns?
Comment 19 Mitch Watrous 2013-03-20 16:59:02 UTC
(In reply to comment #18)
> Created attachment 1536 [details]
> another attempt to get rid of default argument value in AddTestCase
> 
> I've overloaded AddTestCase, now the old method signature is still supported
> but marked as NS_DEPRECATED, and the new method signature is there, without the
> default argument value.
> 
> Mitch, does this revised patch address your concerns?

Yes, that looks good.  I don't have any other objections.
Comment 20 Nicola Baldo 2013-03-22 12:19:06 UTC
changeset:   9266:d26408b17360
user:        Nicola Baldo <nbaldo@cttc.es>
date:        Fri Mar 22 13:14:38 2013 +0100
summary:     bug 1563: get rid of default argument value in AddTestCase

changeset:   9267:1309ad7bb448
tag:         tip
user:        Nicola Baldo <nbaldo@cttc.es>
date:        Fri Mar 22 16:55:43 2013 +0100
summary:     applied variable test scope to LTE tests


I've done a couple test, and I get the following execution times on my machine:

QUICK     :    73.76s
EXTENSIVE :  2286.66s

(only lte-* test suites included in the above)
Comment 21 Tom Henderson 2013-04-01 11:16:02 UTC
There is a lingering memory leak due to this.  It appears to be due to the suppression of EXTENSIVE tests (a new'ed TestCase is not freed) when the default QUICK is enabled.

https://ns-buildmaster.ee.washington.edu:8010/job/Daily-with-valgrind/232/label=Fedora-64-15/console

To reproduce,

./waf --command-template="valgrind --leak-check=full --track-origins=yes %s --suite=lte-phy-error-model" --run test-runner
Comment 22 Mitch Watrous 2013-04-01 14:37:36 UTC
Bug closed.

ns-3 changeset: 3aea6c14a1f9