Conversation
Greptile SummaryThis PR adds content-hash-based file change detection (
Confidence Score: 4/5Not safe to merge as-is: the cache-poisoning bug in One clear P1 defect in the primary feature path (
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[NativeModule.start] --> B[_maybe_build]
B --> C{rebuild_on_change\nset AND exe exists?}
C -- No --> D{exe exists?}
C -- Yes --> E["did_change(update=True ⚠️)\nWrites new hash to cache\nBEFORE build runs"]
E -- False\nno change --> D
E -- True\nchanged --> F[needs_rebuild = True]
F --> G[Run build_command]
D -- Yes, no rebuild --> H[return early / no build]
D -- No, or needs rebuild --> G
G -- returncode != 0 --> I["raise RuntimeError\n(cache already updated ⚠️)"]
G -- success --> J[exe verified exists]
J --> K["did_change again\n(redundant – cache already current)"]
K --> L[return]
I --> M["Next _maybe_build call:\ndid_change → False\n→ skips rebuild forever ⚠️"]
style E fill:#ffcccc,stroke:#cc0000
style I fill:#ffcccc,stroke:#cc0000
style M fill:#ffcccc,stroke:#cc0000
Reviews (1): Last reviewed commit: "native rebuild changes" | Re-trigger Greptile |
| try: | ||
| # Include the path so additions/deletions/renames are detected | ||
| h.update(str(fpath).encode()) | ||
| h.update(fpath.read_bytes()) |
There was a problem hiding this comment.
this will load a full file into memory, ok for source files, but will for sure cause a weird issue for someone down the line. maybe easy hack for this is to check the file size and simply not hash if larger then 500k-1Mb or something
There was a problem hiding this comment.
Switched to streaming to reduce ram overhead, still runs fast (still 100x or 1000x nix's change checking)
Added a max_file_size arg, files greater than max_file_size will switch to using the last modified date instead of actually hashing the file. For rebuild_on_change the user is always expected to only mention source code so probably a non-problem but max_file_size should help this tool be more general.
|
so not sure, haven't looked into details, but is there a reason we aren't relying nix build before each module start, since nix essentially does the above by itself? |
| extra_env: dict[str, str] = Field(default_factory=dict) | ||
| shutdown_timeout: float = 10.0 | ||
| log_format: LogFormat = LogFormat.TEXT | ||
| rebuild_on_change: list[PathEntry] | None = None |
There was a problem hiding this comment.
btw this is the first actual setting that's different between prod and dev, we def don't want this on deployments, so probably if toggled True should be done not in code but the local config file level or via local env vars.
There was a problem hiding this comment.
Whats the difference between prod and dev? I don't follow
There was a problem hiding this comment.
On production the code doesn't change so there's no need to rebuild. (I assume this is what Ivan meant.)
There was a problem hiding this comment.
yeah this hashes all the files in the dir which can be quite costly for larger files, def don't want to run in prod
There was a problem hiding this comment.
yes hashing all files is computationally expensive and we don't want prod deployments to do this during the startup
There was a problem hiding this comment.
all the files in the dir
No, it doesn't hash all the files in the dir
quite costly for larger files
It has the "if large use last edit time" feature, so no its not costly on large files (and why would people point it at large files?)
On my mac rebuild_on_change runs in under 1 millisecond for the largest nav native module. Absolutely unnoticeable.
I feel like you guys are just going off theory
Unfortunately yeah. I don't care much about speed, but rebuild_on_change is 1980x faster than nix's check on reboot, and merely 280x faster once nix has a "warm" cache. And that's for the mid360, which is absolutely tiny. For a native module like PathFollower being checked onboard the G1, nix takes 1037.89ms just to confirm nothing changed ( I'm not thrilled about Now that I think about it though, we should probably have it always rebuild unless there is a |
Twelve fixes from Paul-style code review of #1780: native_module.py: - _maybe_build: use did_change(update=False) + post-success update_cache so failed builds never poison the rebuild cache - delete stray "Source files changed" log line that fired on every call - stop(): per-instance _stop_lock; capture proc/watchdog refs under the lock, do signal/wait/join outside the lock to avoid the watchdog-self- join deadlock; second caller no-ops via _stopping - _read_log_stream: receive pid as a parameter instead of reaching into self._process.pid (TOCTOU with stop()) - _child_preexec_linux + ctypes hoisted to module scope under a sys.platform guard; no more re-import per start() - _clear_nix_executable: gracefully handle cwd=None (skip the walk entirely), use .resolve() comparison so a symlinked cwd still terminates the walk, refuse to walk past cwd's tree change_detect.py: - fold max_file_size into _hash_files digest so different thresholds against the same cache_name can't corrupt each other - new _locked_cache_file context manager — flock the .hash file directly in "a+" mode; no more orphan .lock sidecars accumulating in the cache dir; did_change/update_cache/clear_cache all share the helper Tests: - new test_should_rebuild_true_bypasses_change_check for the explicit "always rebuild" path - new test_failed_build_does_not_mark_cache_clean for the update=False retry semantics Build system: - tiledb darwin overlay: filter patches by name instead of dropping all of them, so a future security patch from nixpkgs survives - livox_sdk_config.hpp: honor $TMPDIR on Darwin instead of hardcoding /tmp Perf script: - compute cache_name inline (using the same inspect-based source_file pattern) instead of calling the inlined _build_cache_name method All 26 tests across test_change_detect.py + test_native_module.py + test_native_rebuild.py pass. ruff + mypy clean. mid360_native and fastlio2_native still build end-to-end on aarch64-darwin. Linux drvPaths for libpqxx/tiledb/pcl/vtk verified unchanged by the overlay.
I feel like I'm missing something since 1 second doesn't actually sound like an issue to me. This runs only in dev envs right? Build systems often take way longer then this. Why was this important for g1 dev? Change detection in the first place is a fancy feature and I don't mind this code in - since it's a generic file change detector that we might want elsewhere as well (I'd probably rely on modification times though) I'm not pushing back against this, just trying to understand this work. btw you can also use linux inotify api and autobuild while inotifywait -r -e modify,create,delete src; do make; doneor to piggyback on nativemodule internal rebuild step, just clean up the build artifacts while inotifywait -r -e modify,create,delete src; do echo "rebuild scheduled"; rm bin/result; done
Builds can get quite expensive so I wouldn't do this by default but I guess there are cases in which you want this yeah, in my mid360 dev I was doing |
Personally, I don't like the idea of adding code under the justification that it could be useful later. Code always has maintenance costs, and less code is always better. I've said this before, but I don't understand the need for all this change detecting code. Build systems should already do this. It could be justifiable if it was intolerably slow, but nix taking 1037.89ms doesn't seem bad. |
yeah I agree actually
yeah I'm also confused, seems like 1500 lines to shave off one second in dev, introduces dev/prod env config split etc |
how can these two things be true at the same time? |
it's prod vs dev in dev - if rebuild in dev takes 1 second to start is a non issue in prod deployments we are not rebuilding stuff, it's important that we don't re-compute a hash of every file in every native module for no reason (I'm assuming native modules can get pretty large and we'll lose startup time and compute every single time dimos runs) |
This PR already has another place where the tooling is used |
|
At the time this PR was created (and still an open question) not every build command is guareteed to use nix. So I figured it'd be nice to both: not call the build command (whatever it might be) merely never or always -- but rather when something changed. My timeline:
I'm just trying to get nav merged, I'm doing everything I can to be as generic/efficient/modular/small-pr's as possible we can revert this in 20min a week from now, I don't care about the feature itself. If you guys want a PROD=true env var that bypasses everything, that sounds fine to me. |
| # modules locally while the coordinator tries to send stop() RPCs, causing | ||
| # BrokenPipeErrors. | ||
| signal.signal(signal.SIGINT, signal.SIG_IGN) | ||
| instances: dict[int, Any] = {} |
There was a problem hiding this comment.
this is a problem that showed up more in native modules, but it is technically not related to native module rebuild. Let me know if this should be a one file PR
|
|
||
| # ctypes is only needed for the Linux child-preexec helper below. Hoisting | ||
| # the import out of the inner function avoids re-importing on every start(). | ||
| if sys.platform.startswith("linux"): |
There was a problem hiding this comment.
this is for better native module cleanup on shutdown
| // mkstemp replaces the 6 X's in place — e.g. livox_sdk_config.aB3xY9. | ||
| // Honor $TMPDIR when set (sandboxed macOS apps and CI runners point | ||
| // it at a per-process scratch dir); fall back to /tmp. | ||
| const char* tmpdir = std::getenv("TMPDIR"); |
There was a problem hiding this comment.
bad transfer, this should be in the rosnav merge, will remove (its macos support for livox)
Pull request was closed
Problem
Editing source code doesnt cause native modules to rebuild. Very easy to forget and not friendly for ai edits.
Solution
Some generic utils:
dimos/utils/change_detect.py: Content-hash-based file change detection usingxxhash.NativeModuleConfig.rebuild_on_change: Optionallist[str|Path|Glob]Breaking Changes
None
How to Test
Contributor License Agreement