I have just started to try out the feature, here are my initial reactions. I have a few suggestions but feel free to take them or leave them.
The API I would have found most intuitive would have been that
d arguments. IMO it’s more explicit to break things down like this. It feels like what function arguments are for in the language.
Each argument can be a vector of length
n so that those who want to take advantage of NumPy efficiencies within
func can do so.
(Personally, my ideal API would be to allow these to be positional or keyword arguments, where
dists would be a dictionary if they are to be treated as keyword arguments. But handling everything as positional seems common within SciPy, and it seems fine to follow that pattern here.)
Using pre-existing samples
Would there be any point in allowing the user to provide pre-existing Monte Carlo samples and compute the indices from these, instead of sampling and simulating again?
I have looked at the paper Saltelli 2010 only briefly, and my understanding is limited, please correct me if I’m blundering here. Looking at section 3 of the paper (but using the notation from the code) it seems like previous samples could be used:
Band their evaluations
f_Bcould come from pre-existing simulations. This saves
2*d*nsampling operations and
- However, computing
f_ABneeds to be done for the sole purpose of the sensitivity analysis. This requires
The idea would be to allow
dists to be a matrix of dimension
(2n, d+s) of existing simulation runs. Then these would be split appropriately to generate
I can see two reasons to allow this:
- Sometimes maybe samples are all you have; you don’t know the mathematical specification of the distribution
- Computational efficiency (discussed more below)
d is large and simulation is more costly than sampling, the computation is dominated by the
d*n simulations, and using pre-existing samples is not worth the hassle.
d is small the whole thing is probably fast so not worth worrying about optimizing.
That leaves the case where sampling is much more expensive than simulation. This can be the case if we are sampling a fully general distribution where only inverse transform sampling is available. Each random sample requires numerically inverting the CDF (i.e. a call to
ppf, which will call a numerical equation solver on the CDF). A very bad case occurs if only the PDF is known, a call to
ppf calls a numerical equation solver on
cdf itself is computed by integrating the PDF!
Option to avoid use of
Following Saltelli 2010, the code uses a quasi-Monte Carlo method for sampling. I know very little about quasi-Monte Carlo, but I saw that the code calls
qmc.Sobol and then samples using
As explained above, there may be cases where sampling via
ppf is very inefficient, but the user has implemented a more efficient
rvs method on their distribution.
How much is the benefit from using quasi-MC methods? Unless it’s extremely large, it may be good to add an option to disable quasi-MC sampling but otherwise leave things unchanged. If quasi-MC is disabled,
rvs would be used in