Split type-checking into interface and implementation in parallel workers#21119
Split type-checking into interface and implementation in parallel workers#21119ilevkivskyi wants to merge 16 commits intopython:masterfrom
Conversation
|
Oh btw, @JukkaL I think there is a bug in |
This comment has been minimized.
This comment has been minimized.
|
All things in (small) |
|
Could be worth adding a test for the discord.py improvement |
|
I'm planning to test this on a big internal repo (probably tomorrow). I'll also try parallel checking again -- last time memory usage was too high to use many workers, but things should be better now. |
|
I'm seeing mypy parallel run crashes with this PR when type checking the biggest internal codebase at work, but I'm not sure if they are caused by this -- this may just change the order of processing so that a pre-existing issue gets triggered. I will continue the investigation after the long weekend. |
|
@JukkaL can you post a traceback (and maybe a snippet of code where the crash happens)? It may well be some implicit assumption breaks when type-checking functions after top-levels. |
|
Diff from mypy_primer, showing the effect of this PR on open source code: discord.py (https://github.com/Rapptz/discord.py)
- discord/backoff.py:63: error: Incompatible default for parameter "integral" (default has type "Literal[False]", parameter has type "Literal[True]") [assignment]
+ discord/backoff.py:63: error: Incompatible default for parameter "integral" (default has type "Literal[False]", parameter has type "T") [assignment]
- discord/interactions.py:1109: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float") [assignment]
- discord/interactions.py:1255: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float") [assignment]
- discord/interactions.py:1645: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float") [assignment]
- discord/webhook/async_.py:969: error: Incompatible default for parameter "delay" (default has type "float | None", parameter has type "float") [assignment]
cki-lib (https://gitlab.com/cki-project/cki-lib)
- cki_lib/krb_ticket_refresher.py:26: error: Call to untyped function "_close_to_expire_ticket" in typed context [no-untyped-call]
+ cki_lib/krb_ticket_refresher.py:26: error: Call to untyped function "_close_to_expire_ticket" of "RefreshKerberosTicket" in typed context [no-untyped-call]
|
|
The internal codebase generates some syntax errors because of an issue with the native parser. After working around the syntax errors, the parallel run completes, so the crashes may be related to syntax errors. However, there are a handful of false positives. Also, this regresses performance -- now parallel checking with two workers is slower than sequential checking (about 10% slower), on macOS. On master parallel checking with two workers is about 13% faster (which is still not great). When looking at |
TBH this is really weird. Can you try running with |
|
I used 2 workers above instead of 3 in my older comment. I can try using 3 workers as well, I think I should have enough RAM for it. |
|
Also you can check communication overhead using |
|
Ok, I will try these as well. Here's the traceback I see on crash (full paths omitted but they don't seem relevant), using the PR with current master merged: There are two tracebacks, but I don't see the syntax errors mypy generates when doing sequential type checking. The crash happens on this line: class AckMessage(IPCMessage):
...
@classmethod
def read(cls, buf: ReadBuffer) -> AckMessage:
assert read_tag(buf) == ACK_MESSAGE # <<-- here
return AckMessage() |
It is a bit surprising that this error happens because of a syntax error. Or is it an error that manifests during semantic analysis or later? Can you also post output of sequential Also as a sanity check, can you check performance of self-check on Mac (compiled)? Be sure to run self-check not from inside of mypy directory, otherwise workers will use interpreted code that they find locally, i.e. use something like |
|
I finally tried parallel checking on Mac, and yeah, it is disastrous. On my work laptop, on current master: 0 workers: 2.8 sec I guess there is some fixed overhead per request on Mac or something. Btw, this PR doesn't really change the amount of data sent (maybe 1-2% increase max), but it makes twice more requests. |
|
Here are some more measurements on a M1 Max mac (using a huge internal repository). First, baseline based on recent master (no split bodies) [mac]:
Second, with this PR (split bodies), with recent master merged [mac]:
On Linux, using 1 worker was slower than zero workers with native parser by ~10%, compared to ~30% slowdown on macOS. In this codebase we can have plenty of parallelism, so split bodies likely won't help much even if they didn't add any overhead. Since the overhead for 1 worker when using split bodies is about twice as bad compared to master (at least on macOS), I assume it's related to the number of messages handled, and unrelated to the amount of data sent. Ideas about how to make this better:
|
Some of the items we are already doing, and I am not going to do any of the rest. Performance on Linux seems good (btw do you have numbers for Linux with multiple workers?) If you (or anyone else) wants to work on Mac, you can do it in your own time. |
|
I don't have full numbers for Linux, but here are the ones I have (for split bodies only):
The overhead from multiple processes is much smaller compared to macOS. It's likely faster than sequential on two workers already, which sounds like a reasonable baseline performance target. I can continue working on macOS performance afterwards (doesn't block this PR). I have both personal and work mac laptops, so I can run measurements in a relatively clean environment without extra security software. I can measure parallel self check on my personal mac tomorrow (probably). I'm also planning to create a parallel checking synthetic benchmark with many small files, to measure coordination overhead. We can also add separate benchmarks with larger files, but it looks like the small file one would be the most helpful at this point. |
|
Couple observations:
I think we should focus on landing currently open PRs first, so that all building blocks are in place. In particular, this PR introduces (inevitable) semantic differences. I think we should agree on what to do with them. I wrote a whole lot of discussion about this in the PR description. But on the other hand, it is hard to keep so many moving parts in my head, so I am going to start merging soon. I will make another couple PRs one for more performance stats, and will try to adjust GC freeze hack so that it works on all Python versions for both sequential and parallel runs (this may be non-trivial). |
|
I will focus on getting more information about the crash and the false positives that I mentioned above so that we can move forward with this PR soon. Since we have ideas about addressing the macOS bottlenecks (even if we might not fully understand the problem yet), this can happen separately from this PR (and I can work on them). |
|
Here's a simplification of one of the new false positives from the big internal codebase, which seems similar to the # mypy: local-partial-types
class C:
x = None # type: ignore[var-annotated]
def f(self) -> object:
if not C.x:
C.x = 1 # New error here with split bodies
return C.x |
|
@JukkaL Yeah, I think this one is fine. Btw I get the same error with |
|
Another simplified regression: from typing import Any, Callable, overload
@overload
def option(*, callback: Callable[[str], object] = ...) -> Any: ...
@overload
def option(*, callback: Callable[[int], object] = ...) -> Any: ...
def option(**kwargs: object) -> None: pass
@option(callback=lambda x: [y for y in x]) # Error here
def f() -> None: passWhen using |
|
I have one more potential regression. I'll investigate it tomorrow. |
The general idea is very straightforward: when doing type-checking, we first type-check only module top-levels and those functions/methods that define/infer externally visible variables. Then we write cache and send new interface hash back to coordinator to unblock more SCCs early. This makes parallel type-checking ~25% faster.
However, this simple idea surfaced multiple quirks and old hacks. I address some of them in this PR, but I decided to handle the rest in follow up PR(s) to limit the size of this one.
First, important implementation details:
select()call, coordinator collects all responses, both interface and implementation ones, and processes them as a single batch. This simplifies reasoning and shouldn't affect performance.foo.meta_ex.ff. Not 100% sure about the name, couldn't find anything more meaningful.testWalrus.local_definitions()now do not yield methods of classes nested in functions. We add such methods to both symbol table of their actual class, and to the module top-level symbol table, thus causing double-processing.Now some smaller things I already fixed:
TypeFormsupport. I think two is enough, so I deleted the last one.AwaitableGeneratorreturn type wrapping used to happen during processing of function body, which is obviously wrong.Finally, some remaining problems and how I propose to address them in followups:
testNarrowingOfFinalPersistsInFunctions. Supporting this will be tricky/expensive, it would require preserving binder state at the point of each function definition, and restoring it later. IMO this is a relatively niche edge case, and we can simply "un-support" it (there is a simple workaround, add an assert in function body). To be clear, there are no problems with a much more common use of this feature: preserving narrowing in nested functions/lambdas.--disallow-incomplete-defsin plugins doesn't work, seetestDisallowIncompleteDefsAttrsPartialAnnotations. I think this should be not hard to fix (with some dedicated cleaner support). I can do this in a follow-up PR soon.testPEP695InferVarianceNotReadyWhenNeeded. However, when processing function/method bodies in a later phase, variance is ready more often. Although this is an improvement, it creates an inconsistency between parallel mode, and regular mode. I propose to address this by making the two-phase logic default even without parallel checking, see below.--local-partial-typeswhen behavior is different in parallel mode, see e.g.testLocalPartialTypesWithGlobalInitializedToNone. Again the new behavior is IMO clearly better. However, it again creates an inconsistency with non-parallel mode. I propose to address this by enabling two-phase (interface then implementation) checking whenever--local-partial-typesis enabled (globally, not per-file), even without parallel checking. Since--local-partial-typeswill be default behavior soon (and hopefully the only behavior at some point), this will allow us to avoid discrepancies between parallel and regular checking. @JukkaL what do you think?