Conversation
Greptile SummaryThis PR introduces a
Confidence Score: 4/5Safe to merge after fixing the silent-host-ignore bug in Dimos.connect(); remaining findings are P2 style. One P1 defect: Dimos.connect(host="x") silently ignores the host and connects to localhost instead, which could cause confusing misdirected connections in production. The rest are P2 style/quality concerns (private API access in LocalModuleSource, lock held during network I/O). dimos/porcelain/dimos.py (P1 logic bug in connect()), dimos/porcelain/local_module_source.py (P2 style) Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Dimos
participant LocalModuleSource
participant RemoteModuleSource
participant ModuleCoordinator
participant RpycServer
participant Worker as PythonWorker (subprocess)
Note over User,Worker: Local path (app.run)
User->>Dimos: run(target)
Dimos->>ModuleCoordinator: build(blueprint)
ModuleCoordinator->>Worker: deploy / start modules
ModuleCoordinator->>RpycServer: start() - coordinator port saved
Dimos->>LocalModuleSource: created with coordinator ref
User->>Dimos: app.skills.ping()
Dimos->>LocalModuleSource: get_rpyc_module("StressTestModule")
LocalModuleSource->>Worker: actor.start_rpyc() - worker port
LocalModuleSource->>Worker: rpyc.connect(localhost, port)
Worker-->>LocalModuleSource: module proxy
LocalModuleSource-->>Dimos: module proxy
Dimos-->>User: skill result
Note over User,Worker: Remote path (Dimos.connect)
User->>Dimos: connect(host, port)
Dimos->>RemoteModuleSource: connect to coordinator RPyC port
RemoteModuleSource->>RpycServer: list_modules() / get_module_endpoint(name)
RpycServer->>ModuleCoordinator: delegate
RemoteModuleSource->>Worker: rpyc.connect(host, worker_port)
Worker-->>RemoteModuleSource: module proxy
RemoteModuleSource-->>User: skill result
Reviews (1): Last reviewed commit: "feat(api): add porcelain api with connec..." | Re-trigger Greptile |
| if host is not None and port is not None: | ||
| source: ModuleSource = RemoteModuleSource(host, port) | ||
| else: | ||
| rpyc_port = get_most_recent_rpyc_port(run_id=run_id) | ||
| source = RemoteModuleSource("localhost", rpyc_port) |
There was a problem hiding this comment.
host silently ignored when port is omitted
When only host is provided, the if host is not None and port is not None guard is False, so the code falls through to the registry lookup and connects to "localhost" — the caller-supplied host is silently dropped. Dimos.connect(host="robot.local") will connect to localhost, not robot.local, with no error raised.
| if host is not None and port is not None: | |
| source: ModuleSource = RemoteModuleSource(host, port) | |
| else: | |
| rpyc_port = get_most_recent_rpyc_port(run_id=run_id) | |
| source = RemoteModuleSource("localhost", rpyc_port) | |
| if (host is None) != (port is None): | |
| raise ValueError( | |
| "host and port must both be provided together, or both omitted" | |
| ) | |
| if host is not None and port is not None: | |
| source: ModuleSource = RemoteModuleSource(host, port) | |
| else: | |
| rpyc_port = get_most_recent_rpyc_port(run_id=run_id) | |
| source = RemoteModuleSource("localhost", rpyc_port) |
| def get_rpyc_module(self, name: str) -> Any: | ||
| with self._lock: | ||
| cached = self._cache.get(name) | ||
| if cached is not None and not cached[0].closed: | ||
| return cached[1] | ||
|
|
||
| with self._coordinator._modules_lock: | ||
| for cls, proxy in self._coordinator._deployed_modules.items(): | ||
| if cls.__name__ == name: | ||
| actor = proxy.actor_instance # type: ignore[attr-defined] | ||
| port = actor.start_rpyc() | ||
| module_id = actor._module_id | ||
| break | ||
| else: | ||
| raise KeyError(name) | ||
|
|
||
| conn = rpyc.connect("localhost", port, config={"sync_request_timeout": 30}) | ||
| module = conn.root.get_module(module_id) | ||
| self._cache[name] = (conn, module) | ||
| return module |
There was a problem hiding this comment.
Lock held across network I/O; private API access
Two concerns here: (1) self._lock is held while rpyc.connect() runs, serializing all concurrent get_rpyc_module calls even for different module names. (2) ModuleCoordinator.get_module_endpoint() was added in this PR to expose exactly this lookup; calling it would remove the dependency on _modules_lock, _deployed_modules, and proxy.actor_instance.
Consider restructuring so the coordinator call and the RPyC connect happen before re-acquiring self._lock to write to the cache.
d3da17c to
889de47
Compare
889de47 to
cb78739
Compare
| if state.rpyc_server is not None: | ||
| return WorkerResponse(result=state.rpyc_server.port) | ||
| WorkerRpycService._instances = state.instances | ||
| state.rpyc_server = ThreadedServer( |
There was a problem hiding this comment.
you listen on 0.0.0.0 which I don't actually mind (we assume Physical layer is our boundary) but we keep getting these security PRs
But then also worries me having a host setting per thing since exposing dimos to network becomes super complicated with 50 host settings.
I'd propose let's define a global setting for "dimos listen host" which is 127.0.0.1 by default.
and then your listen address is that global setting unless explicilty set as a config setting for your feature?
| self._module_transports: dict[type[ModuleBase], dict[str, PubSubTransport[Any]]] = {} | ||
| self._started = False | ||
| self._modules_lock = threading.RLock() | ||
| self._rpyc = RpycServer(self) |
There was a problem hiding this comment.
this creates competing communciation channels through which to call skills etc. we already have @rpc and that rpc client shim, I think this is an ok boundary to what you can access on remote modules.. and this will work as we transition to zenoh, switch to remote etc.
now we have a totally separate new protocol API for calling skills/modules, maybe RpyServer should only be on the coordinator which holds those RPC module shims?
leshy
left a comment
There was a problem hiding this comment.
big comment on actual protocol to talk to workers, but I'm aproving as we iterate. protocol replacable later also.
Another
run() and restart() are only available in local mode. On a connected instance they raise NotImplementedError.
IMO we especially want those in remote mode:
- you start dimos, then claude iterates on the code on individual module, it wants to restart that module
- openclaw controls dimos, wants to deploy a new module
I'm not totally sure why those features are not "basically for free" and require extra work
Problem
Closes DIM-730
Solution
See the docs for how it works.
Breaking Changes
How to Test
Included in the docs.
Contributor License Agreement