Alternative solution to modifications of __module__

Following NumPy and SciPy, pandas modified the __module__ values on various objects (mostly classes and functions) to point to the public API location rather than where they are defined in code. It was then realized that this stopped many of our doctests running, which occurs when the __module__ of the class disagrees with the __module__ of the class methods as the latter were not modified. For more technical details here, see the references at the bottom.

Even though we were able to hack around doctest discovery, the modification of __module__ breaks tools that rely on this dunder to point to the code where the object is defined. For example, the source link in the pandas doc no longer works; e.g. DataFrame now points to pandas/__init__.pyrather than pandas/core/frame.py. Various NumPy objects (such as ufunc) also suffer from this. For the doctests themselves, the stdout used to point to the location where the doctest is defined upon failure but now no longer can - you must grep the codebase to locate the doctest.

Due to this, I think it would be good to seek an alternative solution to the modification of __module__. I think we’ll need to propose a change to Python itself, but I wanted to first make and refine the proposal here (including consideration of any other orthogonal proposals).

My proposal is to add two new, optional, dunders __public_module__ and __public_qualname__. The behavior is as follows.

  • When not explicitly added, __public_module__ will use its parent’s __public_module__ (e.g. a class attribute falls back to the class) when one exists and will use __module__ when it does not.
  • When not explicitly added, __public_qualname__ will use the object’s __name__ in conjunction with its parent’s __public_qualname__ if one exists.
  • The determination of both __public_module__ and __public_qualname__ is done at runtime.

Tools that want to surface the user-facing location of the object (e.g. REPLs, docs) will use __public_module__ and __public_qualname__, whereas tools that required the location the object is in code will continue to use __module__ and __qualname__.

The reason for the use of the parent’s dunder is so that projects do not need to modify every attribute of a class in order to point to a public API location. I also think this should be done at runtime to not have any impact on import time, as I believe the performance impact in cases where this is used (REPLs, docs) is negligible.

Part of this proposal is to not modify the behavior of the default __repr__. Specific tools will need to opt-into using __public_module__ and __public_qualname__which is additional effort, but I do not think we should change the behavior of __repr__ for this.

Though the original issue is fully solved with __public_module__, @mbussonn identified that there can be cases where the name of the object is changed, giving rise to the addition of __public_qualname__.

References:

If __public_qualname__ become a things as I mentined on the IPython issue please use a different marker than . to split the module part for the object part. here is an example of issue and object ambiguity it would lift.

As I mentioned in the IPython issue even if this is just a pandas things I’m happy to add support for it.

There already is an established symbol for this and it is : as used by the entry-point specification.

For some functions I think it actually still works for IPython at least I think (via __wrapped__ and __code__ which has co_* attributes with all the relevant information.

In general __public_qualname__ to me seems the wrong way around. Rather it would make sense to me to have a __code_location__ or so to address the specific problem we want to solve (that tools don’t find the code due to looking in the correct place for the symbol but not for the code itself).
__public_module__ makes sense from a minimal change perspective, I admit. But since it is mainly tools that are confused by this (for users __module__ pointing to the API entry-point is better, as is pluasibly for the default pickling), I would say it is nicer to solve those tools needs rather than to lie about the best definition for __module__.

CPython has defined __module__ as the place where the object is defined (reference). In my opinion, the tools are doing it right and we’re the ones who created the problem by modifying __module__.

If we were to instead add something akin to __code_location__, then in addition to breaking backwards compatibility I think one needs to think about what size of a change this would be in CPython itself, which grepping through the code looks quite substantial. I feel quite negative on this, I imagine CPython devs would as well.

If we were starting from scratch, I personally don’t see any reason to prefer something akin to __code_location__ / __module__ over __module__ / __public_module__.

I suppose we have a different opinion on which information is more important. The place where the code lives, or the place where the object should be found.

Updating __module__ is rare (although people have called it a bug and demanding it in NumPy), but it I think it has advantages, too:

  • Using the public entrypoint for __module__ (i.e. setting it), means it is stable across versions. This is useful for two things:
    • Things like __array_function__ or torch dynamo/numba who want to recognize e.g. NumPy public API.
    • pickle will default to __module__:__qualname for unpickling. Setting __module__ means that code will still unpickle if you refactor your code.

While I understand it is the typical thing, the original module is useful only in the context of finding the original code. Examples are:

  • The sphinx [source] button.
  • IPython help, that allows it to show the source code (which is nice, at least for power users).
  • Some other tooling (e.g. doctests) can benefit from it.

I suppose I just lean towards the first set of things being the more interesting (e.g. due to being stable!) definition for __module__ while the second should all go through inspect.getsourcelines()/inspect.getfile(), etc.

This one I truly don’t really believe. I think that all of these tools should use inspect.<...> (or do the equivalent thing), which means the main place that needs changing is the inspect module to prefer __code_location__ (or whatever makes sense).
I think all other places that will use __module__ within Python should be better off using the actual public API location.

Admittedly, I am coming a bit from the NumPy perspective where we already set __module__ for many years and it didn’t seem all that disruptive, while changing it back to __public_module__ would be disruptive because it would need care to not break the first points.

I think I agree with @seberg that CPython historically did not have too much attention to extension modules and compiled code hence it always comes as an afterthought. Due to this, we always look like we are deliberately breaking the protocols of dunders and other established practices. But it very often stems from real needs (code refactoring, coding language change, pushing things to Cython/C/Pythran…).

So if we were to change where things are defined, many tools will break and often times will point to impossible places. I would not know how to point to a C code from the __module__ as the definition location to give an example. Especially if it is declared in a C header at one place but the extension module includes it in somewhere else. In that sense, code location indeed would be easier.

But then would it point to the source code or the binary object? What if it is a submodule that we use inside the binary? I don’t think any of these granular issues we work with are addressed with the `_module_` mechanism or expected to work.

I am not blaming CPython for not considering these but I think, with compiled code inclusion, a lot of conveniences go out the window.

Indeed these are useful tools and would have been very nice if they worked at least over the source code. But I’m not sure if even inspect can help other than the ones which are thinly-wrapped.

I don’t see “more important” as being even well-defined here. While I do agree __module__ vs __public_module__ is some indication that the former is more important, due to the lack of a prefix, I do find it quite weak. And in my opinion, all of this is trumped by backwards compatibility.

I don’t understand this, can you elaborate?

This is indeed a good point. However from my personal experience, pickles are quite fragile and I would not generally expect a pickle written with one version of a package to be able to be loaded with another version. As such, I do not find it so compelling. However I am open to adding changing pickle to use __public_module__ to this proposal.

I am also open to __module__ becoming the public location and adding some other dunder for tools to use as you are proposing, if it is well received by the community (CPython + tools). I don’t see that being the case, but I’m more than happy to be wrong here.

I do not have much experience looking at CPython internals, but this is at least not clear to me, some examples:

There are many occurrences like this. Looking through CPython m_module is also used to store the __module__ value, I’m not sure if there are others.

I do not understand how this is relevant to this proposal. Can you explain more?

@seberg - you’ve proposed using __code_location__ above. I think to determine this value based on what’s available in CPython today, it would be __code_location__ = f"{__module__}:{__qualname__}", is that right? So the full alternate proposal you’re suggesting is:

  • Add __code_location__ as described immediately above.
  • Make __module__ be what is referred to as __public_module__ in the OP.
  • Make __qualname__ be what is referred to as __public_qualname__ in the OP.
  • Have no equivalent of __module__ and __qualname__ as they exist in CPython today; these can be parsed from __code_location__ using the : separator when needed.

Is this correct?