Split type-checking into interface and implementation in parallel workers by ilevkivskyi · Pull Request #21119 · python/mypy

ilevkivskyi · 2026-03-31T18:34:02Z

The general idea is very straightforward: when doing type-checking, we first type-check only module top-levels and those functions/methods that define/infer externally visible variables. Then we write cache and send new interface hash back to coordinator to unblock more SCCs early. This makes parallel type-checking ~25% faster.

However, this simple idea surfaced multiple quirks and old hacks. I address some of them in this PR, but I decided to handle the rest in follow up PR(s) to limit the size of this one.

First, important implementation details:

On each select() call, coordinator collects all responses, both interface and implementation ones, and processes them as a single batch. This simplifies reasoning and shouldn't affect performance.
We need to write indirect dependencies to a separate cache file, since they are only known after processing function bodies. I combine them together with error messages in files called foo.meta_ex.ff. Not 100% sure about the name, couldn't find anything more meaningful.
Overload signatures are now processed as part of the top-level in type checker. This is a big change, but it is unavoidable and it didn't cause any problems with the daemon.
Initializers (default values of function arguments) are now processed as part of the top-levels (to match runtime semantics). Btw @hauntsaninja you optimized them away in some cases, I am not sure this is safe in presence of walrus, see e.g. testWalrus.
local_definitions() now do not yield methods of classes nested in functions. We add such methods to both symbol table of their actual class, and to the module top-level symbol table, thus causing double-processing.

Now some smaller things I already fixed:

We used to have three scoping systems to track current class in type checker. One existed purely for the purpose of TypeForm support. I think two is enough, so I deleted the last one.
AwaitableGenerator return type wrapping used to happen during processing of function body, which is obviously wrong.
Invalid function redefinitions sometimes caused duplicate errors in case of partial types/deferrals. Now they should not, as I explicitly skip them after emitting first error.
Some generated methods were not marked as such. Now they are.

Finally, some remaining problems and how I propose to address them in followups:

Narrowing of final global variables is not preserved in functions anymore, see testNarrowingOfFinalPersistsInFunctions. Supporting this will be tricky/expensive, it would require preserving binder state at the point of each function definition, and restoring it later. IMO this is a relatively niche edge case, and we can simply "un-support" it (there is a simple workaround, add an assert in function body). To be clear, there are no problems with a much more common use of this feature: preserving narrowing in nested functions/lambdas.
Support for --disallow-incomplete-defs in plugins doesn't work, see testDisallowIncompleteDefsAttrsPartialAnnotations. I think this should be not hard to fix (with some dedicated cleaner support). I can do this in a follow-up PR soon.
Around a dozen incremental tests are skipped in parallel mode because order of error messages is more unstable now (which is expected). To be clear, we still group errors per module, but order of modules is much less predictable now. If there are no objections, I am going to ignore order of modules when comparing errors in incremental tests in a follow-up PR.
When inferred type variable variance is not ready, we fall back to covariance, see testPEP695InferVarianceNotReadyWhenNeeded. However, when processing function/method bodies in a later phase, variance is ready more often. Although this is an improvement, it creates an inconsistency between parallel mode, and regular mode. I propose to address this by making the two-phase logic default even without parallel checking, see below.
Finally, there are few edge cases with --local-partial-types when behavior is different in parallel mode, see e.g. testLocalPartialTypesWithGlobalInitializedToNone. Again the new behavior is IMO clearly better. However, it again creates an inconsistency with non-parallel mode. I propose to address this by enabling two-phase (interface then implementation) checking whenever --local-partial-types is enabled (globally, not per-file), even without parallel checking. Since --local-partial-types will be default behavior soon (and hopefully the only behavior at some point), this will allow us to avoid discrepancies between parallel and regular checking. @JukkaL what do you think?

ilevkivskyi · 2026-03-31T18:36:21Z

Oh btw, @JukkaL I think there is a bug in misc/diff-cache.py that may cause spurious diffs, see a TODO I added.

ilevkivskyi · 2026-03-31T20:17:35Z

All things in (small) mypy_primer are either good or neutral.

hauntsaninja · 2026-04-01T00:30:15Z

Could be worth adding a test for the discord.py improvement

JukkaL · 2026-04-01T16:34:57Z

I'm planning to test this on a big internal repo (probably tomorrow). I'll also try parallel checking again -- last time memory usage was too high to use many workers, but things should be better now.

JukkaL · 2026-04-02T16:30:38Z

I'm seeing mypy parallel run crashes with this PR when type checking the biggest internal codebase at work, but I'm not sure if they are caused by this -- this may just change the order of processing so that a pre-existing issue gets triggered. I will continue the investigation after the long weekend.

ilevkivskyi · 2026-04-02T16:36:37Z

@JukkaL can you post a traceback (and maybe a snippet of code where the crash happens)? It may well be some implicit assumption breaks when type-checking functions after top-levels.

github-actions · 2026-04-06T13:52:21Z

Diff from mypy_primer, showing the effect of this PR on open source code:

discord.py (https://github.com/Rapptz/discord.py)
- discord/backoff.py:63: error: Incompatible default for parameter "integral" (default has type "Literal[False]", parameter has type "Literal[True]")  [assignment]
+ discord/backoff.py:63: error: Incompatible default for parameter "integral" (default has type "Literal[False]", parameter has type "T")  [assignment]
- discord/interactions.py:1109: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float")  [assignment]
- discord/interactions.py:1255: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float")  [assignment]
- discord/interactions.py:1645: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float")  [assignment]
- discord/webhook/async_.py:969: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float")  [assignment]

cki-lib (https://gitlab.com/cki-project/cki-lib)
- cki_lib/krb_ticket_refresher.py:26: error: Call to untyped function "_close_to_expire_ticket" in typed context  [no-untyped-call]
+ cki_lib/krb_ticket_refresher.py:26: error: Call to untyped function "_close_to_expire_ticket" of "RefreshKerberosTicket" in typed context  [no-untyped-call]

JukkaL · 2026-04-08T12:19:03Z

The internal codebase generates some syntax errors because of an issue with the native parser. After working around the syntax errors, the parallel run completes, so the crashes may be related to syntax errors. However, there are a handful of false positives. Also, this regresses performance -- now parallel checking with two workers is slower than sequential checking (about 10% slower), on macOS. On master parallel checking with two workers is about 13% faster (which is still not great).

When looking at top output while a parallel run is active, each worker process is at only about 80% to 90% CPU utilization. It's possible that the added communication/synchronization overhead slows things down, at least on macOS. I'll measure performance on Linux next. I will also try to reproduce the crashes and provide tracebacks.

ilevkivskyi · 2026-04-08T12:32:33Z

Also, this regresses performance -- now parallel checking with two workers is slower than sequential checking (about 10% slower), on macOS. On master parallel checking with two workers is about 13% faster (which is still not great).

TBH this is really weird. Can you try running with --dump-build-stats to understand why it is so slow? Even the master number is worse than what you mentioned before in the very first PR #20280 (comment), while self-check performance improved ~50% since then.

JukkaL · 2026-04-08T12:40:02Z

I used 2 workers above instead of 3 in my older comment. I can try using 3 workers as well, I think I should have enough RAM for it.

ilevkivskyi · 2026-04-08T12:47:32Z

Also you can check communication overhead using --num-workers=0 (in-process checking) vs --num-workers=1 (checking with one separate worker).

JukkaL · 2026-04-08T14:40:40Z

Ok, I will try these as well.

Here's the traceback I see on crash (full paths omitted but they don't seem relevant), using the PR with current master merged:

...
Please report a bug at https://github.com/python/mypy/issues
version: 2.0.0+dev.5ce97dd3ca3b954575311d0c1f361e97910ff04c
<...>/schemas.pyi: note: use --pdb to drop into pdb
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "mypy/build_worker/worker.py", line 105, in main
  File "mypy/build_worker/worker.py", line 177, in serve
  File "mypy/build.py", line 4785, in read
AssertionError

<...>/configuration/models.py: error: INTERNAL ERROR -- Please try using mypy master on GitHub:
https://mypy.readthedocs.io/en/stable/common_issues.html#using-a-development-mypy-build
Please report a bug at https://github.com/python/mypy/issues
version: 2.0.0+dev.5ce97dd3ca3b954575311d0c1f361e97910ff04c
<...>/configuration/models.py: note: use --pdb to drop into pdb
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "mypy/build_worker/worker.py", line 105, in main
  File "mypy/build_worker/worker.py", line 177, in serve
  File "mypy/build.py", line 4785, in read
AssertionError

There are two tracebacks, but I don't see the syntax errors mypy generates when doing sequential type checking.

The crash happens on this line:

class AckMessage(IPCMessage):
    ...
    @classmethod
    def read(cls, buf: ReadBuffer) -> AckMessage:
        assert read_tag(buf) == ACK_MESSAGE  # <<-- here
        return AckMessage()

ilevkivskyi · 2026-04-08T14:58:22Z

There are two tracebacks, but I don't see the syntax errors mypy generates when doing sequential type checking.

It is a bit surprising that this error happens because of a syntax error. Or is it an error that manifests during semantic analysis or later? Can you also post output of sequential mypy with --native-parser?

Also as a sanity check, can you check performance of self-check on Mac (compiled)? Be sure to run self-check not from inside of mypy directory, otherwise workers will use interpreted code that they find locally, i.e. use something like

MYPY_USE_MYPYC=1 pip install -U .
cd ..
rm -rf .mypy_cache/
time mypy --config-file mypy/mypy_self_check.ini mypy/mypy mypy/mypyc --num-workers=6 --native-parse --dump-build-stats

ilevkivskyi · 2026-04-08T15:32:40Z

I finally tried parallel checking on Mac, and yeah, it is disastrous. On my work laptop, on current master:

0 workers: 2.8 sec
1 worker: 7.6 sec
2 workers: 4.4 sec
4 workers: 2.9 sec
8 workers: 2.5 sec

I guess there is some fixed overhead per request on Mac or something.

Btw, this PR doesn't really change the amount of data sent (maybe 1-2% increase max), but it makes twice more requests.
I guess we can declare 2026 year of Linux :-) Btw, just in case, do you have a personal Mac by any chance? Just to eliminate possibility that this is caused by some firewall settings or some security software.

JukkaL · 2026-04-08T16:30:35Z

Here are some more measurements on a M1 Max mac (using a huge internal repository).

First, baseline based on recent master (no split bodies) [mac]:

one process, old parser: 345s
one process, native parser: 326s
1 worker: 378s
2 workers: 288s
3 workers: 244s

Second, with this PR (split bodies), with recent master merged [mac]:

one process, native parser: 321s (basically the same as master)
1 worker: 421s
2 workers: 350s
3 workers: 295s

On Linux, using 1 worker was slower than zero workers with native parser by ~10%, compared to ~30% slowdown on macOS.

In this codebase we can have plenty of parallelism, so split bodies likely won't help much even if they didn't add any overhead.

Since the overhead for 1 worker when using split bodies is about twice as bad compared to master (at least on macOS), I assume it's related to the number of messages handled, and unrelated to the amount of data sent.

Ideas about how to make this better:

Send a batch of files at a time if many files are available in the coordinator (or are we doing this already?).
Only split bodies if file/SCC sizes are above some threshold to reduce the number of messages. The threshold could be different on macOS vs Linux.
Sort files in batches by size in the worker, and send one message per multiple files for tiny files.
Process/send message asynchronously if we aren't doing it (i.e. send message, immediately start processing next file in batch without waiting for response).
Use a more efficient IPC mechanism on macOS or micro-optimize the IPC somehow. Write an IPC microbenchmark and experiment with different options.

ilevkivskyi · 2026-04-08T17:01:11Z

Ideas about how to make this better:

Some of the items we are already doing, and I am not going to do any of the rest. Performance on Linux seems good (btw do you have numbers for Linux with multiple workers?) If you (or anyone else) wants to work on Mac, you can do it in your own time.

JukkaL · 2026-04-08T18:04:57Z

I don't have full numbers for Linux, but here are the ones I have (for split bodies only):

one process (old parser): 698s
one process (new parser): 640s
1 worker: 707s
3 workers: 354s

The overhead from multiple processes is much smaller compared to macOS. It's likely faster than sequential on two workers already, which sounds like a reasonable baseline performance target.

I can continue working on macOS performance afterwards (doesn't block this PR). I have both personal and work mac laptops, so I can run measurements in a relatively clean environment without extra security software. I can measure parallel self check on my personal mac tomorrow (probably).

I'm also planning to create a parallel checking synthetic benchmark with many small files, to measure coordination overhead. We can also add separate benchmarks with larger files, but it looks like the small file one would be the most helpful at this point.

ilevkivskyi · 2026-04-09T01:04:14Z

Couple observations:

I added some more performance stats locally, specifically around all send()/receive() calls. On both Linux and Mac the communication overhead is only 1-2% on self-check. So the problem may actually be with something else.
Very bad performance on self-check on Mac that I observed was actually caused by the fact that I have Python 3.14 there, while I normally use Python 3.12 on Linux. It turns out there were major changes in GC logic in Python 3.14, so that our GC freeze hack now works against us (more precisely gc.freeze() calls get ~1000x slower if there are gc.unfreeze() calls in between them). By simply disabling the GC freeze hack I see ~1.5x speed-up with 4 workers compared to sequential on Mac on current master (with no visible effect of the GC freeze hack on sequential cold runs on Python 3.14).

I can continue working on macOS performance afterwards (doesn't block this PR).

I think we should focus on landing currently open PRs first, so that all building blocks are in place. In particular, this PR introduces (inevitable) semantic differences. I think we should agree on what to do with them. I wrote a whole lot of discussion about this in the PR description. But on the other hand, it is hard to keep so many moving parts in my head, so I am going to start merging soon.

I will make another couple PRs one for more performance stats, and will try to adjust GC freeze hack so that it works on all Python versions for both sequential and parallel runs (this may be non-trivial).

JukkaL · 2026-04-09T15:33:54Z

I will focus on getting more information about the crash and the false positives that I mentioned above so that we can move forward with this PR soon.

Since we have ideas about addressing the macOS bottlenecks (even if we might not fully understand the problem yet), this can happen separately from this PR (and I can work on them).

JukkaL · 2026-04-09T17:34:03Z

Here's a simplification of one of the new false positives from the big internal codebase, which seems similar to the testLocalPartialTypesWithGlobalInitializedToNone case discussed in PR description, and seems benign:

# mypy: local-partial-types

class C:
    x = None  # type: ignore[var-annotated]

    def f(self) -> object:
        if not C.x:
            C.x = 1  # New error here with split bodies
        return C.x

ilevkivskyi · 2026-04-09T17:44:07Z

@JukkaL Yeah, I think this one is fine. Btw I get the same error with --allow-redefinition-new without split bodies. In some sense this split logic makes local partial types even a bit more "local".

JukkaL · 2026-04-09T18:13:51Z

Another simplified regression:

from typing import Any, Callable, overload

@overload
def option(*, callback: Callable[[str], object] = ...) -> Any: ...
@overload
def option(*, callback: Callable[[int], object] = ...) -> Any: ...
def option(**kwargs: object) -> None: pass

@option(callback=lambda x: [y for y in x])  # Error here
def f() -> None: pass

When using --num-workers=1, it generates this error (with --num-workers=0, there is no error):

t.py:9: error: "int" has no attribute "__iter__"; maybe "__int__"? (not iterable)  [attr-defined]

JukkaL · 2026-04-09T18:17:08Z

I have one more potential regression. I'll investigate it tomorrow.

ilevkivskyi added 14 commits March 27, 2026 17:59

Some foundation

91228b1

Remove guard for overloads

beeb100

Mark some generated things as generated

6af211e

Write indirect dependencies separately

eb41ad2

Fix fixing awaitable generator

f3b4878

Check initializers as part of top levels

025a25d

Do not double-process methods in nested classes

ed1cdc6

Simplify class checker scope

966020b

Fix function redefinition

c30a0d2

Make decorator inference in semanal consistent with checker

d2c6312

Skip/tweak some more tests

efd98b3

Split remaining two tests

36f9d8f

Cleanups/comments

2e5f7d5

Some more comments and refactoring

59ea2a0

ilevkivskyi requested a review from JukkaL March 31, 2026 18:34

ilevkivskyi changed the title ~~Split type-checking into interface and impplementation in parallel workers~~ Split type-checking into interface and implementation in parallel workers Mar 31, 2026

This comment has been minimized.

Sign in to view

ilevkivskyi mentioned this pull request Apr 5, 2026

Parse files in parallel when possible #21175

Open

ilevkivskyi added 2 commits April 6, 2026 14:29

Add a test for accidental discord.py fix

0ddd940

Merge remote-tracking branch 'upstream/master' into intf-impl-parallel

c79069b

ilevkivskyi mentioned this pull request Apr 9, 2026

Syntax error triggers crash in parallel mode #21195

Open

Uh oh!

Conversation

ilevkivskyi commented Mar 31, 2026

Uh oh!

ilevkivskyi commented Mar 31, 2026

Uh oh!

This comment has been minimized.

ilevkivskyi commented Mar 31, 2026

Uh oh!

hauntsaninja commented Apr 1, 2026

Uh oh!

JukkaL commented Apr 1, 2026

Uh oh!

JukkaL commented Apr 2, 2026

Uh oh!

ilevkivskyi commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

JukkaL commented Apr 8, 2026

Uh oh!

ilevkivskyi commented Apr 8, 2026

Uh oh!

JukkaL commented Apr 8, 2026

Uh oh!

ilevkivskyi commented Apr 8, 2026

Uh oh!

JukkaL commented Apr 8, 2026

Uh oh!

ilevkivskyi commented Apr 8, 2026

Uh oh!

ilevkivskyi commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JukkaL commented Apr 8, 2026

Uh oh!

ilevkivskyi commented Apr 8, 2026

Uh oh!

JukkaL commented Apr 8, 2026

Uh oh!

ilevkivskyi commented Apr 9, 2026

Uh oh!

JukkaL commented Apr 9, 2026

Uh oh!

JukkaL commented Apr 9, 2026

Uh oh!

ilevkivskyi commented Apr 9, 2026

Uh oh!

JukkaL commented Apr 9, 2026

Uh oh!

JukkaL commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilevkivskyi commented Apr 8, 2026 •

edited

Loading