Discussion of Sobol' indices (new in SciPy dev)

tadamcz · June 10, 2023, 2:34pm

I was excited to see this merged SciPy PR, adding Sobol’ indices. They are a measure used in global sensitivity analysis (Wikipedia has good background). Thanks to the contributors.

I have just started to try out the feature, here are my initial reactions. I have a few suggestions but feel free to take them or leave them.

func signature
The API I would have found most intuitive would have been that func takes d arguments. IMO it’s more explicit to break things down like this. It feels like what function arguments are for in the language.

Each argument can be a vector of length n so that those who want to take advantage of NumPy efficiencies within func can do so.

(Personally, my ideal API would be to allow these to be positional or keyword arguments, where dists would be a dictionary if they are to be treated as keyword arguments. But handling everything as positional seems common within SciPy, and it seems fine to follow that pattern here.)

Using pre-existing samples
Would there be any point in allowing the user to provide pre-existing Monte Carlo samples and compute the indices from these, instead of sampling and simulating again?

I have looked at the paper Saltelli 2010 only briefly, and my understanding is limited, please correct me if I’m blundering here. Looking at section 3 of the paper (but using the notation from the code) it seems like previous samples could be used:

Matrices A and B and their evaluations f_A and f_B could come from pre-existing simulations. This saves 2*d*n sampling operations and 2*n simulations.
However, computing f_AB needs to be done for the sole purpose of the sensitivity analysis. This requires d*n simulations.

The idea would be to allow dists to be a matrix of dimension (2n, d+s) of existing simulation runs. Then these would be split appropriately to generate A and B, and f_A and f_B.

I can see two reasons to allow this:

Sometimes maybe samples are all you have; you don’t know the mathematical specification of the distribution
Computational efficiency (discussed more below)

Computational efficiency

Insofar as d is large and simulation is more costly than sampling, the computation is dominated by the d*n simulations, and using pre-existing samples is not worth the hassle.

If d is small the whole thing is probably fast so not worth worrying about optimizing.

That leaves the case where sampling is much more expensive than simulation. This can be the case if we are sampling a fully general distribution where only inverse transform sampling is available. Each random sample requires numerically inverting the CDF (i.e. a call to ppf, which will call a numerical equation solver on the CDF). A very bad case occurs if only the PDF is known, a call to ppf calls a numerical equation solver on cdf, but cdf itself is computed by integrating the PDF!

Option to avoid use of ppf

Following Saltelli 2010, the code uses a quasi-Monte Carlo method for sampling. I know very little about quasi-Monte Carlo, but I saw that the code calls qmc.Sobol and then samples using ppf.

As explained above, there may be cases where sampling via ppf is very inefficient, but the user has implemented a more efficient rvs method on their distribution.

How much is the benefit from using quasi-MC methods? Unless it’s extremely large, it may be good to add an option to disable quasi-MC sampling but otherwise leave things unchanged. If quasi-MC is disabled, rvs would be used in sample_A_B.

tadamcz · June 10, 2023, 2:42pm

Regarding the pre-existing samples, if the concern is computational efficiency, alternatively one could go the other way: allow sobol_indices to return the samples for further use elsewhere in the program.

tadamcz · June 10, 2023, 3:22pm

A few quick observations on the docs.

This method signature is not formatted in the generated HTML, is that on purpose or a mistake?

image1396×800 74.4 KB

IMO these two comments are very generic and don’t add much, I would remove them:

In fact, the distribution is very important and should match the real distribution of the parameters.

The distribution of parameters depends on the application and should be carefully chosen

I’m being pedantic, but I would slightly reword this:

Parameter should be independently distributed meaning there should be no constraint nor relationship between input parameters values.

In the context of the documentation for the parameter dists, the word “should” could potentially be confusing to beginners. It’s not that they “should” be independent, it’s that there is no way to specify them as anything other than independent since it’s a list of marginals.

So maybe something like:

dists : list(distributions), optional

List of each parameter’s distribution. Parameters are assumed to be independently distributed.
…

tupui · June 10, 2023, 3:57pm

Thank you @tadamcz for the feedback, it is really appreciated!

It’s a deliberate choice that func only takes a single argument which is x. x here is actually an array with all samples from a given matrix like A. So the user defined function can take advantage of vectorisation if wanted.

As to why we don’t allow to pass extra parameters, after seeing what happened in scipy.optimize where we tried different APIs to allow users to pass args, kwargs we actually see that there is no point in doing so and it just adds to the complexity of the API. Hence our decision here. It very easy for a user to do whatever they want in the function itself. As you can code a wrapper, use some caching, call a remote machine, etc. If you have a specific use case you need help with, I am happy to have a look and provide some guidance on how to do things.

It is possible, if you have a look at the doc of func, you can pass existing samples instead using a dictionary. You could also make a function that would use some lookup mechanism to use pre-existing computation.

To be clear here, I deliberately made this less visible and strongly refused to make it more easy as our experience on SALib and OpenTURNS showed that most of our users were doing very wrong things with this capability. This is leading to very wrong results and misunderstanding.

The method is really only working if you are careful about the way you produce the matrices. E.g. we have people that want to remove some samples from the matrices for some reasons (like NaNs) and this is plain wrong and must not be done (the algorithm is not robust to that.)

We had a bit of discussion around the efficiency of sampling. My rational here is that this is not important. Even if you are in high dimension, generating even millions of samples takes no time (and our ppf are quite fast.) Unless of course you would want to use SA in a hot loop, I am not seeing the use case though. Happy to discuss this further. But again, I really consider that the whole SA could take 10s and that would still be acceptable as this is typically just done once and all the rest of the analysis would take much longer.

A lot! This is really important and drastically improve the convergence speed. Andreas and Sergei showed this a few times.

If the whole sampling part is really an issue, I am happy to have a closer look. We initially had a slightly different approach that allowed to use our new fast samplers (using UNU.RAN and fast inverses.)

If we have more feedbacks there, we could consider it. It’s actually present in the result object, just marked as private to tell users it’s for us and there is no guarantee it would be always present.

So we are using a common theme which currently does not really allow nesting. I will raise that again on the theme side. This is something we have in a few places and we don’t have a good answer for that yet.

Any doc improvement is welcomed, feel free to make a PR.

In that case maybe the 2 statements could be combined, but I would still have a version of that information somewhere. A lot of users are using uniform distribution because they don’t care and this is leading to wrong interpretations (again seen a lot and on SALib.)

Agreed, we should change this.

tadamcz · June 10, 2023, 5:32pm

Thanks for the quick response!

Also, because a few of your comments I thought maybe it’s worth clarifying that most of the issues here are not things that I personally cannot do with the library and need it to change, they were more things that generally came to mind about the design of the program. But I really appreciate the offer to further discuss specific applications.

It is possible, if you have a look at the doc of func , you can pass existing samples instead using a dictionary.

Yes, I saw that, but you have to pass f_A, f_B, and f_AB – at that point (especially for generating f_AB) you might as well re-write a lot of the internal code yourself. That’s why I suggested allowing dists to be a matrix of dimension (2n, d+s) of existing simulation runs as the API if one were to add this feature.

The method is really only working if you are careful about the way you produce the matrices

I 100% agree we shouldn’t expect many users to correctly produce something like the f_AB matrix which is quite complicated and you need to read the paper to understand what it is. The suggestion I described was only to allow to pass in a single matrix of simulation runs into dists. Can you help me see how this would be problematic? If you exclude certain rows the samples you supply you’re supplying a different distribution, it’s very transparent and clear what’s going on.

Perhaps worth saying that my general take is that we should avoid creating situations where it’s easy to misunderstand and shoot yourself in the foot, but I wouldn’t make big trade-offs to protect people who are absolutely determined to shoot themselves :).

our ppf are quite fast

My point is that it depends on the distribution. If for illustration I give you the following distribution, it’s 200 milliseconds for a single call to ppf:


from scipy import stats

prior = stats.lognorm(1, 1)
likelihood_func = lambda x: stats.norm.pdf(x, 2, 1)


class Posterior(stats.rv_continuous):
    def __init__(self, prior, likelihood):
        self.prior = prior
        self.likelihood = likelihood
        super().__init__()

    def _pdf(self, x):
        return self.prior.pdf(x) * self.likelihood(x)


posterior = Posterior(prior, likelihood_func)

timeit_samples = 50
t = timeit.timeit(lambda: posterior.ppf(0.123), number=timeit_samples) / timeit_samples
print(f"ppf: {t:.3f} seconds")

To be fair, probably 99% of cases just use SciPy’s built-in distributions, which as you pointed out have quite fast ppf.

A lot of users are using uniform distribution because they don’t care

If that’s the distribution they also use in their Monte Carlo simulation, it may be an empirically inappropriate choice, but I don’t see how we as library authors can prevent people from making those choices. If you mean that they use a different distribution in their main analysis and then in the sensitivity analysis they just put uniform(0,1) for everything, I fear the problem with people’s understanding is much more fundamental…

A lot! This is really important and drastically improve the convergence speed. Andreas and Sergei showed this a few times.

This is cool to hear, especially if it’s true in a wide range of realistic use cases. Where can I read more about this? It somewhat clashes with my intuition so I’m especially curious to learn more.

It’s a deliberate choice that func only takes a single argument which is x. x here is actually an array with all samples from a given matrix like A. So the user defined function can take advantage of vectorisation if wanted.

Is vectorisation the primary reason, then, or simplicity? I completely agree with vectorization along the n samples dimension, so the suggestion jtbc was d arguments, each of which is a vector of length n. That feels to me like a better API with most/all of the potential performance benefit. In general I don’t find it simpler to pass around a single big object of d dimensions vs d objects of 1 dimension, the complexity is the same it’s just about which one is more semantically appropriate (I agree that allowing both args and kwargs is a bit more complexity, but that’s a separate issue). It seems like your decision about the signature for func has been carefully considered and won’t change, so I’m not trying to change your mind but more trying to understand better what I’m missing / how you see the trade-off.

Sometimes maybe samples are all you have; you don’t know the mathematical specification of the distribution

This is the other argument apart from performance for passing in existing samples, what do you think of that one? Maybe the case where you don’t even know the specification is quite artificial / unlikely, but I could imagine cases where it’s quite a hassle or error-prone to translate the distributions into SciPy rv_continuous format. Like suppose it’s generated by an external program outside Python, or just some gnarly code you didn’t write and don’t understand…

tadamcz · June 10, 2023, 5:40pm

So we are using a common theme which currently does not really allow nesting. I will raise that again on the theme side. This is something we have in a few places and we don’t have a good answer for that yet.

I assumed it was something like that

Any doc improvement is welcomed, feel free to make a PR.

OK great! If I do get around to this, I will put up a PR for only what I think are uncontroversial improvements to the docs.

tadamcz · June 11, 2023, 6:30pm

PR for docs: DOC: Sobol' indices: corrections, clarifications, spelling & grammar by tadamcz · Pull Request #18659 · scipy/scipy · GitHub

tadamcz · June 11, 2023, 8:54pm

Hi @tupui, I think I found a bigger mistake in the docs:

First order indices sum to 1

This is an error, right? I’m not going crazy?

I’ll put up another commit in the same PR to fix it.

tupui · June 12, 2023, 10:39am

The issue is that you need a way to then split the matrix of runs into matrices. I don’t see how you would be able to do this for the user. Runs must be constructed to strictly follow the requirements of the matrices. A and B (and resulting f_A, f_B) are easy to get from any matrix if you assume that the sample are independent (but unless you use QMC, convergence will be far from optimal.) But then how do you extract AB (f_AB) from that? The only way to link the samples properly would be to also provide the matrices A and B so that you find the intersections and get AB. Sounds like an expensive search.

SALib allows to pass a single matrix, but it is strictly the concatenation of f_A, f_B, f_AB. Only thing is that we have seen countless issues with people just giving any matrix of run without respecting anything, hence results are very wrong.

The truth is that you cannot really reuse some existing samples. There is just no way that without having in mind AB, you would actually find any sample belonging to AB in your sample, especially if as the dimensionality grows.

So the rational for now is to say: you want more control and use samples you created on the side? Fine but provide the matrices separately so we know for sure that you generated these properly.

We have (very) fast methods for the ppf. Also you should use an object for the definition of your distributions and ppf supports passing an array for x.

import numpy as np
from scipy.stats.sampling import NumericalInversePolynomial


class Posterior:
    prior = stats.lognorm(s=1, loc=1)
    likelihood = stats.norm(loc=2, scale=1)

    def pdf(self, x):
        return self.prior.pdf(x) * self.likelihood.pdf(x)


dist = Posterior()
urng = np.random.default_rng()
rng = SimpleRatioUniforms(dist, center=1, random_state=urng)
rng.ppf(...)  # this is very fast

This is the general statement about QMC. A lot of papers from Andrea and Sergei for instance would use Sobol’ sequence to generate the samples. Why does this surprise you?

As the dimensionality grows, you will cover the space better with QMC, hence get a better chance at capturing the true variability of the function.

Ok I understand what you mean now. Yes this was carefully thought and it’s more efficient to have a single matrix of size (d, n) (memory wise, and to be able to vectorize along dimensions). The other close contestant was (n, d), which is what samplers like scipy.stats.qmc.sobol would typically give. But for a user and how Python an NumPy actually works it’s just better to be able to refer to a single dimension by just doing x[1] instead of x[:, 1]. So very clear and simpler for a user to decompose variables inside its function.

You should almost never do that… rv_continuous is really not user friendly and we are actually thinking about making a new API for distributions (we have a grant for that, another maintainer and I. You can see the other post on this forum with the SciPy tag ). This is why for this function we only require that dist is any class which has a ppf method. I really want to keep the QMC part there so I would ask people, at least until we have our new API, to use things like NumericalInversePolynomial. But if we do get multiple people asking for more flexibility around custom distributions, we will think about this further.

Thank you for opening a PR and spotting that mistake . Yes, we rewrote this bit a few times and seems like we just ended up with some confusing and contradicting statements