Skip to content

Split type-checking into interface and implementation in parallel workers#21119

Open
ilevkivskyi wants to merge 16 commits intopython:masterfrom
ilevkivskyi:intf-impl-parallel
Open

Split type-checking into interface and implementation in parallel workers#21119
ilevkivskyi wants to merge 16 commits intopython:masterfrom
ilevkivskyi:intf-impl-parallel

Conversation

@ilevkivskyi
Copy link
Copy Markdown
Member

The general idea is very straightforward: when doing type-checking, we first type-check only module top-levels and those functions/methods that define/infer externally visible variables. Then we write cache and send new interface hash back to coordinator to unblock more SCCs early. This makes parallel type-checking ~25% faster.

However, this simple idea surfaced multiple quirks and old hacks. I address some of them in this PR, but I decided to handle the rest in follow up PR(s) to limit the size of this one.

First, important implementation details:

  • On each select() call, coordinator collects all responses, both interface and implementation ones, and processes them as a single batch. This simplifies reasoning and shouldn't affect performance.
  • We need to write indirect dependencies to a separate cache file, since they are only known after processing function bodies. I combine them together with error messages in files called foo.meta_ex.ff. Not 100% sure about the name, couldn't find anything more meaningful.
  • Overload signatures are now processed as part of the top-level in type checker. This is a big change, but it is unavoidable and it didn't cause any problems with the daemon.
  • Initializers (default values of function arguments) are now processed as part of the top-levels (to match runtime semantics). Btw @hauntsaninja you optimized them away in some cases, I am not sure this is safe in presence of walrus, see e.g. testWalrus.
  • local_definitions() now do not yield methods of classes nested in functions. We add such methods to both symbol table of their actual class, and to the module top-level symbol table, thus causing double-processing.

Now some smaller things I already fixed:

  • We used to have three scoping systems to track current class in type checker. One existed purely for the purpose of TypeForm support. I think two is enough, so I deleted the last one.
  • AwaitableGenerator return type wrapping used to happen during processing of function body, which is obviously wrong.
  • Invalid function redefinitions sometimes caused duplicate errors in case of partial types/deferrals. Now they should not, as I explicitly skip them after emitting first error.
  • Some generated methods were not marked as such. Now they are.

Finally, some remaining problems and how I propose to address them in followups:

  • Narrowing of final global variables is not preserved in functions anymore, see testNarrowingOfFinalPersistsInFunctions. Supporting this will be tricky/expensive, it would require preserving binder state at the point of each function definition, and restoring it later. IMO this is a relatively niche edge case, and we can simply "un-support" it (there is a simple workaround, add an assert in function body). To be clear, there are no problems with a much more common use of this feature: preserving narrowing in nested functions/lambdas.
  • Support for --disallow-incomplete-defs in plugins doesn't work, see testDisallowIncompleteDefsAttrsPartialAnnotations. I think this should be not hard to fix (with some dedicated cleaner support). I can do this in a follow-up PR soon.
  • Around a dozen incremental tests are skipped in parallel mode because order of error messages is more unstable now (which is expected). To be clear, we still group errors per module, but order of modules is much less predictable now. If there are no objections, I am going to ignore order of modules when comparing errors in incremental tests in a follow-up PR.
  • When inferred type variable variance is not ready, we fall back to covariance, see testPEP695InferVarianceNotReadyWhenNeeded. However, when processing function/method bodies in a later phase, variance is ready more often. Although this is an improvement, it creates an inconsistency between parallel mode, and regular mode. I propose to address this by making the two-phase logic default even without parallel checking, see below.
  • Finally, there are few edge cases with --local-partial-types when behavior is different in parallel mode, see e.g. testLocalPartialTypesWithGlobalInitializedToNone. Again the new behavior is IMO clearly better. However, it again creates an inconsistency with non-parallel mode. I propose to address this by enabling two-phase (interface then implementation) checking whenever --local-partial-types is enabled (globally, not per-file), even without parallel checking. Since --local-partial-types will be default behavior soon (and hopefully the only behavior at some point), this will allow us to avoid discrepancies between parallel and regular checking. @JukkaL what do you think?

@ilevkivskyi ilevkivskyi requested a review from JukkaL March 31, 2026 18:34
@ilevkivskyi ilevkivskyi changed the title Split type-checking into interface and impplementation in parallel workers Split type-checking into interface and implementation in parallel workers Mar 31, 2026
@ilevkivskyi
Copy link
Copy Markdown
Member Author

Oh btw, @JukkaL I think there is a bug in misc/diff-cache.py that may cause spurious diffs, see a TODO I added.

@github-actions

This comment has been minimized.

@ilevkivskyi
Copy link
Copy Markdown
Member Author

All things in (small) mypy_primer are either good or neutral.

@hauntsaninja
Copy link
Copy Markdown
Collaborator

Could be worth adding a test for the discord.py improvement

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 1, 2026

I'm planning to test this on a big internal repo (probably tomorrow). I'll also try parallel checking again -- last time memory usage was too high to use many workers, but things should be better now.

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 2, 2026

I'm seeing mypy parallel run crashes with this PR when type checking the biggest internal codebase at work, but I'm not sure if they are caused by this -- this may just change the order of processing so that a pre-existing issue gets triggered. I will continue the investigation after the long weekend.

@ilevkivskyi
Copy link
Copy Markdown
Member Author

@JukkaL can you post a traceback (and maybe a snippet of code where the crash happens)? It may well be some implicit assumption breaks when type-checking functions after top-levels.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Diff from mypy_primer, showing the effect of this PR on open source code:

discord.py (https://github.com/Rapptz/discord.py)
- discord/backoff.py:63: error: Incompatible default for parameter "integral" (default has type "Literal[False]", parameter has type "Literal[True]")  [assignment]
+ discord/backoff.py:63: error: Incompatible default for parameter "integral" (default has type "Literal[False]", parameter has type "T")  [assignment]
- discord/interactions.py:1109: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float")  [assignment]
- discord/interactions.py:1255: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float")  [assignment]
- discord/interactions.py:1645: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float")  [assignment]
- discord/webhook/async_.py:969: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float")  [assignment]

cki-lib (https://gitlab.com/cki-project/cki-lib)
- cki_lib/krb_ticket_refresher.py:26: error: Call to untyped function "_close_to_expire_ticket" in typed context  [no-untyped-call]
+ cki_lib/krb_ticket_refresher.py:26: error: Call to untyped function "_close_to_expire_ticket" of "RefreshKerberosTicket" in typed context  [no-untyped-call]

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 8, 2026

The internal codebase generates some syntax errors because of an issue with the native parser. After working around the syntax errors, the parallel run completes, so the crashes may be related to syntax errors. However, there are a handful of false positives. Also, this regresses performance -- now parallel checking with two workers is slower than sequential checking (about 10% slower), on macOS. On master parallel checking with two workers is about 13% faster (which is still not great).

When looking at top output while a parallel run is active, each worker process is at only about 80% to 90% CPU utilization. It's possible that the added communication/synchronization overhead slows things down, at least on macOS. I'll measure performance on Linux next. I will also try to reproduce the crashes and provide tracebacks.

@ilevkivskyi
Copy link
Copy Markdown
Member Author

Also, this regresses performance -- now parallel checking with two workers is slower than sequential checking (about 10% slower), on macOS. On master parallel checking with two workers is about 13% faster (which is still not great).

TBH this is really weird. Can you try running with --dump-build-stats to understand why it is so slow? Even the master number is worse than what you mentioned before in the very first PR #20280 (comment), while self-check performance improved ~50% since then.

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 8, 2026

I used 2 workers above instead of 3 in my older comment. I can try using 3 workers as well, I think I should have enough RAM for it.

@ilevkivskyi
Copy link
Copy Markdown
Member Author

Also you can check communication overhead using --num-workers=0 (in-process checking) vs --num-workers=1 (checking with one separate worker).

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 8, 2026

Ok, I will try these as well.

Here's the traceback I see on crash (full paths omitted but they don't seem relevant), using the PR with current master merged:

...
Please report a bug at https://github.com/python/mypy/issues
version: 2.0.0+dev.5ce97dd3ca3b954575311d0c1f361e97910ff04c
<...>/schemas.pyi: note: use --pdb to drop into pdb
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "mypy/build_worker/worker.py", line 105, in main
  File "mypy/build_worker/worker.py", line 177, in serve
  File "mypy/build.py", line 4785, in read
AssertionError

<...>/configuration/models.py: error: INTERNAL ERROR -- Please try using mypy master on GitHub:
https://mypy.readthedocs.io/en/stable/common_issues.html#using-a-development-mypy-build
Please report a bug at https://github.com/python/mypy/issues
version: 2.0.0+dev.5ce97dd3ca3b954575311d0c1f361e97910ff04c
<...>/configuration/models.py: note: use --pdb to drop into pdb
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "mypy/build_worker/worker.py", line 105, in main
  File "mypy/build_worker/worker.py", line 177, in serve
  File "mypy/build.py", line 4785, in read
AssertionError

There are two tracebacks, but I don't see the syntax errors mypy generates when doing sequential type checking.

The crash happens on this line:

class AckMessage(IPCMessage):
    ...
    @classmethod
    def read(cls, buf: ReadBuffer) -> AckMessage:
        assert read_tag(buf) == ACK_MESSAGE  # <<-- here
        return AckMessage()

@ilevkivskyi
Copy link
Copy Markdown
Member Author

There are two tracebacks, but I don't see the syntax errors mypy generates when doing sequential type checking.

It is a bit surprising that this error happens because of a syntax error. Or is it an error that manifests during semantic analysis or later? Can you also post output of sequential mypy with --native-parser?

Also as a sanity check, can you check performance of self-check on Mac (compiled)? Be sure to run self-check not from inside of mypy directory, otherwise workers will use interpreted code that they find locally, i.e. use something like

MYPY_USE_MYPYC=1 pip install -U .
cd ..
rm -rf .mypy_cache/
time mypy --config-file mypy/mypy_self_check.ini mypy/mypy mypy/mypyc --num-workers=6 --native-parse --dump-build-stats

@ilevkivskyi
Copy link
Copy Markdown
Member Author

ilevkivskyi commented Apr 8, 2026

I finally tried parallel checking on Mac, and yeah, it is disastrous. On my work laptop, on current master:

0 workers: 2.8 sec
1 worker: 7.6 sec
2 workers: 4.4 sec
4 workers: 2.9 sec
8 workers: 2.5 sec

I guess there is some fixed overhead per request on Mac or something.

Btw, this PR doesn't really change the amount of data sent (maybe 1-2% increase max), but it makes twice more requests.
I guess we can declare 2026 year of Linux :-) Btw, just in case, do you have a personal Mac by any chance? Just to eliminate possibility that this is caused by some firewall settings or some security software.

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 8, 2026

Here are some more measurements on a M1 Max mac (using a huge internal repository).

First, baseline based on recent master (no split bodies) [mac]:

  • one process, old parser: 345s
  • one process, native parser: 326s
  • 1 worker: 378s
  • 2 workers: 288s
  • 3 workers: 244s

Second, with this PR (split bodies), with recent master merged [mac]:

  • one process, native parser: 321s (basically the same as master)
  • 1 worker: 421s
  • 2 workers: 350s
  • 3 workers: 295s

On Linux, using 1 worker was slower than zero workers with native parser by ~10%, compared to ~30% slowdown on macOS.

In this codebase we can have plenty of parallelism, so split bodies likely won't help much even if they didn't add any overhead.

Since the overhead for 1 worker when using split bodies is about twice as bad compared to master (at least on macOS), I assume it's related to the number of messages handled, and unrelated to the amount of data sent.

Ideas about how to make this better:

  • Send a batch of files at a time if many files are available in the coordinator (or are we doing this already?).
  • Only split bodies if file/SCC sizes are above some threshold to reduce the number of messages. The threshold could be different on macOS vs Linux.
  • Sort files in batches by size in the worker, and send one message per multiple files for tiny files.
  • Process/send message asynchronously if we aren't doing it (i.e. send message, immediately start processing next file in batch without waiting for response).
  • Use a more efficient IPC mechanism on macOS or micro-optimize the IPC somehow. Write an IPC microbenchmark and experiment with different options.

@ilevkivskyi
Copy link
Copy Markdown
Member Author

Ideas about how to make this better:

Some of the items we are already doing, and I am not going to do any of the rest. Performance on Linux seems good (btw do you have numbers for Linux with multiple workers?) If you (or anyone else) wants to work on Mac, you can do it in your own time.

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 8, 2026

I don't have full numbers for Linux, but here are the ones I have (for split bodies only):

  • one process (old parser): 698s
  • one process (new parser): 640s
  • 1 worker: 707s
  • 3 workers: 354s

The overhead from multiple processes is much smaller compared to macOS. It's likely faster than sequential on two workers already, which sounds like a reasonable baseline performance target.

I can continue working on macOS performance afterwards (doesn't block this PR). I have both personal and work mac laptops, so I can run measurements in a relatively clean environment without extra security software. I can measure parallel self check on my personal mac tomorrow (probably).

I'm also planning to create a parallel checking synthetic benchmark with many small files, to measure coordination overhead. We can also add separate benchmarks with larger files, but it looks like the small file one would be the most helpful at this point.

@ilevkivskyi
Copy link
Copy Markdown
Member Author

Couple observations:

  • I added some more performance stats locally, specifically around all send()/receive() calls. On both Linux and Mac the communication overhead is only 1-2% on self-check. So the problem may actually be with something else.
  • Very bad performance on self-check on Mac that I observed was actually caused by the fact that I have Python 3.14 there, while I normally use Python 3.12 on Linux. It turns out there were major changes in GC logic in Python 3.14, so that our GC freeze hack now works against us (more precisely gc.freeze() calls get ~1000x slower if there are gc.unfreeze() calls in between them). By simply disabling the GC freeze hack I see ~1.5x speed-up with 4 workers compared to sequential on Mac on current master (with no visible effect of the GC freeze hack on sequential cold runs on Python 3.14).

I can continue working on macOS performance afterwards (doesn't block this PR).

I think we should focus on landing currently open PRs first, so that all building blocks are in place. In particular, this PR introduces (inevitable) semantic differences. I think we should agree on what to do with them. I wrote a whole lot of discussion about this in the PR description. But on the other hand, it is hard to keep so many moving parts in my head, so I am going to start merging soon.

I will make another couple PRs one for more performance stats, and will try to adjust GC freeze hack so that it works on all Python versions for both sequential and parallel runs (this may be non-trivial).

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 9, 2026

I will focus on getting more information about the crash and the false positives that I mentioned above so that we can move forward with this PR soon.

Since we have ideas about addressing the macOS bottlenecks (even if we might not fully understand the problem yet), this can happen separately from this PR (and I can work on them).

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 9, 2026

Here's a simplification of one of the new false positives from the big internal codebase, which seems similar to the testLocalPartialTypesWithGlobalInitializedToNone case discussed in PR description, and seems benign:

# mypy: local-partial-types

class C:
    x = None  # type: ignore[var-annotated]

    def f(self) -> object:
        if not C.x:
            C.x = 1  # New error here with split bodies
        return C.x

@ilevkivskyi
Copy link
Copy Markdown
Member Author

@JukkaL Yeah, I think this one is fine. Btw I get the same error with --allow-redefinition-new without split bodies. In some sense this split logic makes local partial types even a bit more "local".

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 9, 2026

Another simplified regression:

from typing import Any, Callable, overload

@overload
def option(*, callback: Callable[[str], object] = ...) -> Any: ...
@overload
def option(*, callback: Callable[[int], object] = ...) -> Any: ...
def option(**kwargs: object) -> None: pass

@option(callback=lambda x: [y for y in x])  # Error here
def f() -> None: pass

When using --num-workers=1, it generates this error (with --num-workers=0, there is no error):

t.py:9: error: "int" has no attribute "__iter__"; maybe "__int__"? (not iterable)  [attr-defined]

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 9, 2026

I have one more potential regression. I'll investigate it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants