I was excited to see this merged SciPy PR, adding Sobol’ indices. They are a measure used in global sensitivity analysis (Wikipedia has good background). Thanks to the contributors.
I have just started to try out the feature, here are my initial reactions. I have a few suggestions but feel free to take them or leave them.
func
signature
The API I would have found most intuitive would have been that func
takes d
arguments. IMO it’s more explicit to break things down like this. It feels like what function arguments are for in the language.
Each argument can be a vector of length n
so that those who want to take advantage of NumPy efficiencies within func
can do so.
(Personally, my ideal API would be to allow these to be positional or keyword arguments, where dists
would be a dictionary if they are to be treated as keyword arguments. But handling everything as positional seems common within SciPy, and it seems fine to follow that pattern here.)
Using pre-existing samples
Would there be any point in allowing the user to provide pre-existing Monte Carlo samples and compute the indices from these, instead of sampling and simulating again?
I have looked at the paper Saltelli 2010 only briefly, and my understanding is limited, please correct me if I’m blundering here. Looking at section 3 of the paper (but using the notation from the code) it seems like previous samples could be used:
- Matrices
A
andB
and their evaluationsf_A
andf_B
could come from pre-existing simulations. This saves2*d*n
sampling operations and2*n
simulations. - However, computing
f_AB
needs to be done for the sole purpose of the sensitivity analysis. This requiresd*n
simulations.
The idea would be to allow dists
to be a matrix of dimension (2n, d+s)
of existing simulation runs. Then these would be split appropriately to generate A
and B
, and f_A
and f_B
.
I can see two reasons to allow this:
- Sometimes maybe samples are all you have; you don’t know the mathematical specification of the distribution
- Computational efficiency (discussed more below)
Computational efficiency
Insofar as d
is large and simulation is more costly than sampling, the computation is dominated by the d*n
simulations, and using pre-existing samples is not worth the hassle.
If d
is small the whole thing is probably fast so not worth worrying about optimizing.
That leaves the case where sampling is much more expensive than simulation. This can be the case if we are sampling a fully general distribution where only inverse transform sampling is available. Each random sample requires numerically inverting the CDF (i.e. a call to ppf
, which will call a numerical equation solver on the CDF). A very bad case occurs if only the PDF is known, a call to ppf
calls a numerical equation solver on cdf
, but cdf
itself is computed by integrating the PDF!
Option to avoid use of ppf
Following Saltelli 2010, the code uses a quasi-Monte Carlo method for sampling. I know very little about quasi-Monte Carlo, but I saw that the code calls qmc.Sobol
and then samples using ppf
.
As explained above, there may be cases where sampling via ppf
is very inefficient, but the user has implemented a more efficient rvs
method on their distribution.
How much is the benefit from using quasi-MC methods? Unless it’s extremely large, it may be good to add an option to disable quasi-MC sampling but otherwise leave things unchanged. If quasi-MC is disabled, rvs
would be used in sample_A_B
.