I was excited to see this merged SciPy PR, adding Sobol’ indices. They are a measure used in global sensitivity analysis (Wikipedia has good background). Thanks to the contributors.

I have just started to try out the feature, here are my initial reactions. I have a few suggestions but feel free to take them or leave them.

`func`

signature

The API I would have found most intuitive would have been that `func`

takes `d`

arguments. IMO it’s more explicit to break things down like this. It feels like what function arguments are for in the language.

Each argument can be a vector of length `n`

so that those who want to take advantage of NumPy efficiencies within `func`

can do so.

(Personally, my ideal API would be to allow these to be positional or keyword arguments, where `dists`

would be a dictionary if they are to be treated as keyword arguments. But handling everything as positional seems common within SciPy, and it seems fine to follow that pattern here.)

**Using pre-existing samples**

Would there be any point in allowing the user to provide pre-existing Monte Carlo samples and compute the indices from these, instead of sampling and simulating again?

I have looked at the paper Saltelli 2010 only briefly, and my understanding is limited, please correct me if I’m blundering here. Looking at section 3 of the paper (but using the notation from the code) it seems like previous samples could be used:

- Matrices
`A`

and`B`

and their evaluations`f_A`

and`f_B`

could come from pre-existing simulations. This saves`2*d*n`

sampling operations and`2*n`

simulations. - However, computing
`f_AB`

needs to be done for the sole purpose of the sensitivity analysis. This requires`d*n`

simulations.

The idea would be to allow `dists`

to be a matrix of dimension `(2n, d+s)`

of existing simulation runs. Then these would be split appropriately to generate `A`

and `B`

, and `f_A`

and `f_B`

.

I can see two reasons to allow this:

- Sometimes maybe samples are all you have; you don’t know the mathematical specification of the distribution
- Computational efficiency (discussed more below)

*Computational efficiency*

Insofar as `d`

is large and simulation is more costly than sampling, the computation is dominated by the `d*n`

simulations, and using pre-existing samples is not worth the hassle.

If `d`

is small the whole thing is probably fast so not worth worrying about optimizing.

That leaves the case where sampling is much more expensive than simulation. This can be the case if we are sampling a fully general distribution where only inverse transform sampling is available. Each random sample requires numerically inverting the CDF (i.e. a call to `ppf`

, which will call a numerical equation solver on the CDF). A very bad case occurs if only the PDF is known, a call to `ppf`

calls a numerical equation solver on `cdf`

, but `cdf`

itself is computed by integrating the PDF!

**Option to avoid use of ppf**

Following Saltelli 2010, the code uses a quasi-Monte Carlo method for sampling. I know very little about quasi-Monte Carlo, but I saw that the code calls `qmc.Sobol`

and then samples using `ppf`

.

As explained above, there may be cases where sampling via `ppf`

is very inefficient, but the user has implemented a more efficient `rvs`

method on their distribution.

How much is the benefit from using quasi-MC methods? Unless it’s extremely large, it may be good to add an option to disable quasi-MC sampling but otherwise leave things unchanged. If quasi-MC is disabled, `rvs`

would be used in `sample_A_B`

.