How to format mathematical expressions?

tupui · July 7, 2021, 7:39am

I would like to propose to discuss about styling equations.

It is known that the PEP8 and other established styling documents are missing guidelines about maths. Hence everyone comes with its own interpretation and style. I believe I do not have to motivate the benefits for having a common coding style in general as this is well established (easier to share, code is familiar and coherent across the ecosystem, etc.).

I think such a document is missing from the scientific community and my hope is that we can all agree on something

To be transparent: there are heated discussions on SciPy around this topic and formatting tools. Indeed, having extensive guidelines (like PEP8) could allow such tools to implement a mathematical style. Opinion varies here around the feasibility and need of styling maths. I still think this is a good idea and has value.

To quick-start things here are some ideas. DISCLAMER: I do not claim these are correct, this is just to start the discussion. Feel free to rewrite everything. I am just proposing an idea which I believe could help the community in general. I hope we have productive chats here.

Formatting Mathematical Expressions

To format mathematical expressions, the following rules must be followed. These rules respect and complement the PEP8 (relevant sections includes id20and id28)

If operators with different priorities are used, add whitespace around the operators with the lowest priority(ies).
There is no space before and after **.
There is no space before and after operators *,/. Only exception is if the expression consist of a single operator linking two groups.
There a space before and after -, +. Except if : (i) the operator is used to define the sign of the number; (ii) the operator is used in a group to mark higher priority.
When splitting an equation, new lines should start with the operator linking the previous and next logical block. Single digit, brackets on a line are forbidden. Use the available horizontal space as much as possible.

# Correct:
i = i + 1
submitted += 1
x = x*2 - 1
hypot2 = x*x + y*y
c = (a+b) * (a-b)
dfdx = sign*(-2*x + 2*y + 2)
result = 2 * x**2 + 3 * x**(2/3)
y = 4*x**2 + 2*x + 1
c_i1j = (1./n**2.
         * np.prod(0.5*(2.+abs(z_ij[i1, :])
                        + abs(z_ij) - abs(z_ij[i1, :]-z_ij)), axis=1))

# Wrong:
i=i+1
submitted +=1
x = x * 2 - 1
hypot2 = x * x + y * y
c = (a + b) * (a - b)
dfdx = sign * (-2 * x + 2 * y + 2)
result = 2 * x ** 2 + 3 * x ** (2 / 3)
y = 4 * x ** 2 + 2 * x + 1
c_i1j = (1.
         / n ** 2.
         * np.prod(0.5 * (2. + abs(z_ij[i1, :])
                          + abs(z_ij) - abs(z_ij[i1, :] - z_ij)), axis=1))

rgommers · July 8, 2021, 7:35pm

Thanks for kicking this off @tupui. It’d be nice to have something a little more detailed than PEP 8 (like “no spaces around the ** operator” is missing from PEP 8), and hopefully that will help tool authors to improve how they format numerical code.

Most of your Correct/Wrong examples are clear. The exception is c_i1j. I think it’s very difficult to say which version is better, and also hard to create any rules for your “correct” version. For example, in

         1./n**2.
         * np.prod(...

the precedence for / and * is the same, but only one of the two operators has spaces around it.

stefanv · July 15, 2021, 5:30pm

Thanks for raising this issue @tupui!

What Ralf wrote makes me think that there’s some stylistic input into this, making it very hard to automatically format correctly unless you insert spaces everywhere like black does:

i = i + 1
submitted += 1
x = x * 2 - 1
hypot2 = x * x + y * y
c = (a + b) * (a - b)
dfdx = sign * (-2 * x + 2 * y + 2)
result = 2 * x ** 2 + 3 * x ** (2 / 3)
y = 4 * x ** 2 + 2 * x + 1
c_i1j = (
    1.0
    / n ** 2.0
    * np.prod(
        0.5 * (2.0 + abs(z_ij[i1, :]) + abs(z_ij) - abs(z_ij[i1, :] - z_ij)), axis=1
    )
)

So, perhaps the ideal checker would do what @rkern mentioned on the SciPy issue: ensure that PEP8 (or some superset of that) is conformed to, but not make adjustments where that is already the case.

tupui · July 15, 2021, 5:54pm

Thank you @rgommers and @stefanv.

Indeed I should have written this

# Correct:
i = i + 1
submitted += 1
x = x*2 - 1
hypot2 = x*x + y*y
c = (a+b) * (a-b)
dfdx = sign*(-2*x + 2*y + 2)
result = 2*x**2 + 3*x**(2/3)
y = 4*x**2 + 2*x + 1
c_i1j = (1./n**2.
         *np.prod(0.5*(2.+abs(z_ij[i1, :])
                       + abs(z_ij) - abs(z_ij[i1, :]-z_ij)), axis=1))

I agree that it would be difficult to, not just code such a system, but also just write using these rules For this to work, it has to be simple. I guess I would be personally ok with what Black does without spaces around **,/,* and spaces otherwise. This would be simpler to use as you wouldn’t have to count and check who has the highest priority, etc.

i = i + 1
submitted += 1
x = x*2 - 1
hypot2 = x*x + y*y
c = (a + b)*(a - b)
dfdx = sign*(-2*x + 2*y + 2)
result = 2*x**2 + 3*x**(2/3)
y = 4*x**2 + 2*x + 1
c_i1j = (
    1.0
    /n**2.0
    *np.prod(
        0.5*(2.0 + abs(z_ij[i1, :]) + abs(z_ij) - abs(z_ij[i1, :] - z_ij)), axis=1
    )
)

stefanv · July 15, 2021, 8:42pm

Personally, the only operator that really irks me with a space around it is **.

Writing 2*x feels natural, but in the example above I’d expect * np.prod instead of *np.prod. So, it varies even in my own head on a case-by-case basis.

stefanv · November 10, 2021, 7:44pm

Output from yapf, starting from the top output:

i = i + 1
submitted += 1
x = x * 2 - 1
hypot2 = x * x + y * y
c = (a + b) * (a - b)
dfdx = sign * (-2 * x + 2 * y + 2)
result = 2 * x**2 + 3 * x**(2 / 3)
y = 4 * x**2 + 2 * x + 1
c_i1j = (1.0 / n**2.0 * np.prod(
    0.5 * (2.0 + abs(z_ij[i1, :]) + abs(z_ij) - abs(z_ij[i1, :] - z_ij)),
    axis=1))

Slightly different than when starting from the input:

i = i + 1
submitted += 1
x = x * 2 - 1
hypot2 = x * x + y * y
c = (a + b) * (a - b)
dfdx = sign * (-2 * x + 2 * y + 2)
result = 2 * x**2 + 3 * x**(2 / 3)
y = 4 * x**2 + 2 * x + 1
c_i1j = (1. / n**2. *
         np.prod(0.5 *
                 (2. + abs(z_ij[i1, :]) + abs(z_ij) - abs(z_ij[i1, :] - z_ij)),
                 axis=1))

yapf also allows formatting of diffs only, and they have a pre-commit hook (although the latter has not been updated to include the diff-only handling).

ev-br · November 10, 2021, 7:50pm

I’d suggest to follow PEP8 in calling examples good/bad, not correct/wrong :-).

It’s going to vary somewhat anyway, so I really hope these are recommendations, not normative prescriptions.

stefanv · November 10, 2021, 7:57pm

I think we were looking at this from the angle of: is there a tool that will format the code reasonably well so we never have to think of this problem ever again. So far, yapf seems the most promising.

stefanv · November 10, 2021, 8:09pm

@mbussonn Do you have any experience with yapf? I see you’ve worked on darker.

mbussonn · November 11, 2021, 11:11pm

I’ve tried yapf a bit in the past but not much touched the codebase. One of the issue is that yapf has(had?) many configuration knobs, and that the question then is which values for each configuration to use.

Personally I’ve also struggled with readability in formulas, and I think that many time the correct format is “it depends”, especially since sometime intermediate variable and how you write and equation may affect performance.

tupui · February 10, 2022, 8:43am

FYI, black now tries to remove spaces around **. See The Black code style - Black 22.1.0 documentation

rossbar · February 10, 2022, 4:13pm

I noticed this the other day so thought I’d mention it here: IPython now auto-formats code with black by default if black is found in the current env: 8.x Series — IPython 8.0.1 documentation

tupui · February 10, 2022, 4:44pm

Good point!

(Even if there was some tension about this on twitter coming from people doing training… I am personally happy about the change (and I support Matthias). Like why would you care about teaching a style ever if it’s getting fixed. In go and rust the community is using the same formatter and there is just no discussion. Anyway.)

And the list of major projects using it is growing. And now that Black is officially not in beta anymore, it will just keep growing.

I also see lot of PRs which are auto formatted with Black. In the end it’s going to be more work for maintainers to actually adjust changes from Black if everyone in their IDE has Black.

mattip · August 10, 2022, 5:07pm

Over at NumPy, there is a PR to add a pre-commit hook that uses black’s formatting rules. Is there any interest in trying to approach the black dev team as a community and request some changes? I think we may have a voice together that can outweigh the “stick to PEP 8 rules” mindset I see in the math formatting issue opened in black a few years ago.

tupui · August 10, 2022, 5:29pm

Yes so I have to write something up, but my plan is actually to propose a SPEC here. I already had some initial discussions with Łukasz Langa (Blacks author and Python core dev) who basically said we just need to agree as a whole scientific community.

stefanv · August 10, 2022, 6:26pm

I’ll be back September 1st. Would that be too late for us to meet to try and come up with some suggested rules? Perhaps we can do some upfront homework by then already.

tgross35 · August 10, 2022, 7:28pm

Since the issue linked by @tupui above (Allow no space between operator ** · Issue #538 · psf/black · GitHub), black’s formatting is better for the ** operator. It currently produces this:

i = i + 1
submitted += 1
x = x * 2 - 1
hypot2 = x * x + y * y
c = (a + b) * (a - b)
dfdx = sign * (-2 * x + 2 * y + 2)
result = 2 * x**2 + 3 * x ** (2 / 3)
y = 4 * x**2 + 2 * x + 1
c_i1j = (
    1.0
    / n**2.0
    * np.prod(
        0.5 * (2.0 + abs(z_ij[i1, :]) + abs(z_ij) - abs(z_ij[i1, :] - z_ij)), axis=1
    )
)

So the remaining points of contention from the original issue are:

If operators with different priorities are used, add whitespace around the operators with the lowest priority(ies).

There is no space before and after operators *,/. Only exception is if the expression consist of a single operator linking two groups.

When splitting an equation, new lines should start with the operator linking the previous and next logical block. Single digit, brackets on a line are forbidden. Use the available horizontal space as much as possible.

Could items 1 & 2 be summarized as “Remove spacing between * or / and ‘simple’ operands if there are higher priority operators (+, -) in the expression”?

Assuming that “simple” operand is defined the same was as it is in the ** spacing issue:

an operand is considered “simple” if it’s only a NAME, numeric CONSTANT, or attribute access (chained attribute access is allowed), with or without a preceding unary operator.

Though whether this is beneficial enough to be worth consideration is debatable, as mentioned by Stefan.

Regarding issue 3, I would suggest instead deferring to the return & indent style that black currently uses, but with attempting to keep numerical expressions within one line. Something like this:

c_i1j = 1.0 / n**2.0 * np.prod(
        0.5 * (2.0 + abs(z_ij[i1, :]) + abs(z_ij) - abs(z_ij[i1, :] - z_ij)),
        axis=1
    )

The main reasons for this are just consistency with the existing black format, and avoiding a lot of lost whitespace when an align point is in the middle of the screen.

tupui · August 11, 2022, 4:17pm

Hi all, I just created a summit proposal so that we discuss this altogether and try to find a solution: Mathematical expressions Summit
Cast a vote if you want this to happen!

rossbar · August 25, 2022, 4:22pm

Just a quick update - this topic has been a central discussion point in the last two NumPy triage meetings. There’s interest amongst the numpy developers in applying black to NumPy, but the mathematical expression formatting is one of the main blockers.

All that’s to say progress on formalizing a standard here would be a boon to moving the NumPy discussion forward

jarrodmillman · October 16, 2023, 5:37pm

“The Ruff formatter is out as an Alpha release, starting from Ruff v0.0.289.”

This may be a good opportunity to influence a new formatter to be more aligned with the format scientific Python projects prefer. I made one comment that it would be great to get feedback / support on:

Also it may be worth raising the mathematical expression formatting issue there. In particular, if there are simple, small improvements to suggest. Also we could request a --scientific-python flag or something to toggle the formatting of mathematical expressions to one more aligned with scientific computing. But we would still need to agree on that format, so it would need some design discussion (perhaps as a SPEC).