Bug 2725 - EmpiricalRandomVariable must not interpolate CDF
EmpiricalRandomVariable must not interpolate CDF
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: core
ns-3-dev
All All
: P3 normal
Assigned To: Peter Barnes
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-04-07 14:28 UTC by Alexander Krotov
Modified: 2020-05-12 20:06 UTC (History)
3 users (show)

See Also:


Attachments
Testcase (3.81 KB, patch)
2017-04-07 14:28 UTC, Alexander Krotov
Details | Diff
Bugfix (3.41 KB, patch)
2017-04-07 14:36 UTC, Alexander Krotov
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Krotov 2017-04-07 14:28:56 UTC
Created attachment 2827 [details]
Testcase

To sample a value from empirical CDF, EmpiricalRandomVariable draws uniform random variable between 0.0 and 1.0 and uses inverse transform sampling to convert it to actual value.

The problem is that when drawn value falls between two CDF points (nearly always), it linearly interpolates CDF.

For example, if you sample discrete random variable, which is 0.0 with probability 0.5 and 100.0 with probablity 0.5, its CDF has two points: (0.0, 0.5) and (100.0, 1.0). Mean value is 50.0. But if you linearly interpolate CDF, mean value is 25.0.

For continuous random variables this bug may not affect you if you sample a lot of points from original distribution, but with discrete random variables it will affect results regardless of how many statistics you have.

The solution is not to interpolate CDF at all. It is the correct way to resample random variables.

Attached is a test case, bugfix follows.
Comment 1 Alexander Krotov 2017-04-07 14:36:17 UTC
Created attachment 2828 [details]
Bugfix

Here is a bugfix.

Note that it makes some existing testcases incorrect. These artificially created testcases, such as "uniform distributions" with 3 samples, are wrong. I think they should be replaced.
Comment 2 Tommaso Pecorella 2017-04-18 15:48:54 UTC
I understand the problem, but I'd reject the patch.

The problem is in the duality between continuous and discrete random variables, and in the fact that we're using the very same method to draw both numbers.

You're totally right about discrete random numbers CDFs: we have a bug.
For continuous random numbers CDFs we can discuss a lot about what's the best strategy, and I doubt we'd ever get an agreement (if not on the fact that the more empirical points we have, the better).

As an example, shall we use a 0-th interpolation (your proposal), a 1-th interpolation (old behavior), 2-th interpolation (also a possibility) ?

Questions... do we have a statistician around ?
 

(In reply to Alexander Krotov from comment #0)
> Created attachment 2827 [details]
> Testcase
> 
> To sample a value from empirical CDF, EmpiricalRandomVariable draws uniform
> random variable between 0.0 and 1.0 and uses inverse transform sampling to
> convert it to actual value.
> 
> The problem is that when drawn value falls between two CDF points (nearly
> always), it linearly interpolates CDF.
> 
> For example, if you sample discrete random variable, which is 0.0 with
> probability 0.5 and 100.0 with probablity 0.5, its CDF has two points: (0.0,
> 0.5) and (100.0, 1.0). Mean value is 50.0. But if you linearly interpolate
> CDF, mean value is 25.0.
> 
> For continuous random variables this bug may not affect you if you sample a
> lot of points from original distribution, but with discrete random variables
> it will affect results regardless of how many statistics you have.
> 
> The solution is not to interpolate CDF at all. It is the correct way to
> resample random variables.
> 
> Attached is a test case, bugfix follows.
Comment 3 Alexander Krotov 2017-04-18 16:28:39 UTC
(In reply to Tommaso Pecorella from comment #2)
> I understand the problem, but I'd reject the patch.
> 
> The problem is in the duality between continuous and discrete random
> variables, and in the fact that we're using the very same method to draw
> both numbers.
> 
> You're totally right about discrete random numbers CDFs: we have a bug.
> For continuous random numbers CDFs we can discuss a lot about what's the
> best strategy, and I doubt we'd ever get an agreement (if not on the fact
> that the more empirical points we have, the better).

Discrete random numbers are just a subset of continuous random numbers. Without additional assumptions any approach that is wrong for discrete random variables is also wrong for some continuous random variables.

> As an example, shall we use a 0-th interpolation (your proposal), a 1-th
> interpolation (old behavior), 2-th interpolation (also a possibility) ?

If you make no assumptions about the true CDF, ECDF is the only unbiased estimator.

With linear, quadratic, cubic interpolation etc no matter how many empirical points you have, you will have a biased random variable for some distributions (discrete is just an example).

With ECDF and infinite number of empirical points you are essentially just shuffling them around inside infinite stream. If random numbers emitted by original distribution are independent, it is impossible to see the difference between original random variable and ECDF-based.
Comment 4 Alexander Krotov 2017-04-18 16:41:16 UTC
> As an example, shall we use a 0-th interpolation (your proposal), a 1-th
> interpolation (old behavior), 2-th interpolation (also a possibility) ?

By the way, the next thing present in any book on non-parametric statistics after EDF is the kernel density estimation. It is more complex and it is also unclear what kernel to use, but at least you get consistent unbiased estimator.

Where did the idea of interpolating EDF even comes from? The only thing I can find is some stack overflow question:
https://stats.stackexchange.com/questions/4909/interpolating-the-empirical-cumulative-function
Comment 5 Peter Barnes 2017-04-18 16:57:59 UTC
I agree that interpolating an EDF is odd.  When we base a DF on a histogram, we've already assumed responsibility for sampling appropriately for the underlying distribution.  If it's lumpy that's due to the underlying distribution.  The only correct thing is to take it at face value, without interpolation.  If I want interpolation, to smooth out the edges, I need to add points to the EDF.
Comment 6 Tommaso Pecorella 2017-04-18 17:19:16 UTC
You're right, interpolation isn't a good idea.

Perhaps *fitting* could be a better idea, but I'd rather avoid doing stuff that is more appropriate for a statistical program (like R).

Suggestion: add documentation to the patch, in particular about the changed behaviour and that the use of empirical data have to be validated by other means (i.e., how many points it's not our business).




(In reply to Alexander Krotov from comment #4)
> > As an example, shall we use a 0-th interpolation (your proposal), a 1-th
> > interpolation (old behavior), 2-th interpolation (also a possibility) ?
> 
> By the way, the next thing present in any book on non-parametric statistics
> after EDF is the kernel density estimation. It is more complex and it is
> also unclear what kernel to use, but at least you get consistent unbiased
> estimator.
> 
> Where did the idea of interpolating EDF even comes from? The only thing I
> can find is some stack overflow question:
> https://stats.stackexchange.com/questions/4909/interpolating-the-empirical-
> cumulative-function
Comment 7 Peter Barnes 2019-11-25 19:51:57 UTC
One more suggestion:  

Leave the Interpolate function in the code, in case someone wants to use it, but remove it from the GetValue implementation, as in the patch.
Comment 8 Tom Henderson 2019-11-26 00:52:22 UTC
I suspect that it may be inspired by MATLAB:
https://www.mathworks.com/matlabcentral/fileexchange/7976-random-number-from-empirical-distribution

See also:
http://www-eio.upc.edu/~lmontero/lmm_tm/MESIO-SIM%20-%20%28US%29%20Generation%20of%20random%20variables.pdf

It seems to me that perhaps there are two classes needed.  One would be suitable for empirical observations from an underlying discrete distribution (no interpolation), and one for continuous.  

I also suggest to align what we have with MATLAB, so that one could do this in MATLAB as suggested on the above webpage:

% Generate 1000 random normal numbers as data for EMPRAND
dist = randn(1000,1);
% Now generate 2000 random numbers from this data
xr = emprand(dist,2000,1);

and one could do similarly with our version of this empirical random variable.
Comment 9 Peter Barnes 2019-11-26 16:33:38 UTC
I've proposed this as a GCI task, leaving in the Interpolate, (and not implementing Tom's latest suggestion).
Comment 10 Peter Barnes 2020-05-12 20:06:49 UTC
Merged in fa7d6a894ed808bec9c15f997a3d997f5e12023f