Skip to content

native rebuild changes#1780

Closed
jeff-hykin wants to merge 20 commits intodevfrom
jeff/feat/native_rebuild2
Closed

native rebuild changes#1780
jeff-hykin wants to merge 20 commits intodevfrom
jeff/feat/native_rebuild2

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

Problem

Editing source code doesnt cause native modules to rebuild. Very easy to forget and not friendly for ai edits.

Solution

Some generic utils:

  • dimos/utils/change_detect.py: Content-hash-based file change detection using xxhash.
  • NativeModuleConfig.rebuild_on_change: Optional list[str|Path|Glob]

Breaking Changes

None

How to Test

# Change detection tests
pytest dimos/utils/test_change_detect.py -v

# Native module rebuild tests
pytest dimos/core/test_native_rebuild.py -v

# Native module crash/thread leak test
pytest dimos/core/test_native_module.py::test_process_crash_triggers_stop -v

# LCM isolation tests
pytest dimos/protocol/pubsub/test_pattern_sub.py -v
pytest dimos/protocol/pubsub/impl/test_lcmpubsub.py -v

# Full fast suite
pytest -m 'not (tool or slow or mujoco)' dimos/

Contributor License Agreement

  • I have read and approved the CLA.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 12, 2026

Greptile Summary

This PR adds content-hash-based file change detection (dimos/utils/change_detect.py) and wires it into NativeModuleConfig.rebuild_on_change so that native modules automatically rebuild when their source files change.

  • _maybe_build cache-poisoning bug (native_module.py L400–407): did_change is called with the default update=True, writing the new hash to the cache before the build runs. If the build fails with RuntimeError, the cache already reflects the current source hash, so every subsequent _maybe_build() call sees no change and silently skips the rebuild — indefinitely. The fix is update=False on the pre-build check and update_cache(...) after confirmed success.
  • hash_paths None in f-string (mesh_utils.py L87): hash_paths returns str | None; using it directly in the f-string produces a literal "v3_None_…" cache directory for missing/unreadable URDF files.

Confidence Score: 4/5

Not safe to merge as-is: the cache-poisoning bug in _maybe_build means a single failed build silently suppresses all future rebuild attempts, which is the exact problem this PR is designed to solve.

One clear P1 defect in the primary feature path (_maybe_build updating the cache before build success) and one P1 in mesh_utils.py (None-in-f-string cache key). The change_detect utility itself is well-designed; the bug is in how it's consumed.

dimos/core/native_module.py (cache update ordering) and dimos/manipulation/planning/utils/mesh_utils.py (None guard for hash_paths).

Important Files Changed

Filename Overview
dimos/core/native_module.py Adds rebuild_on_change to NativeModuleConfig and _maybe_build change-detection logic; contains a P1 bug where did_change is called with update=True before the build runs, meaning a failed build permanently poisons the cache and suppresses all future rebuild attempts.
dimos/utils/change_detect.py New change-detection utility using xxhash with thread + flock cross-process locking; well-designed API with update=False / update_cache pattern; minor issue: clear_cache doesn't acquire locks before deleting the hash file.
dimos/manipulation/planning/utils/mesh_utils.py Migrates to hash_dict/hash_paths from change_detect; hash_paths return value is used directly in an f-string without a None guard, which silently produces a "v3_None_…" cache key for missing URDF files.
dimos/utils/test_change_detect.py Thorough tests for change_detect covering glob expansion, cache isolation, update=False semantics, relative paths, and the build workflow pattern.
dimos/core/test_native_rebuild.py Tests rebuild-on-change happy paths; does not cover the failed-build scenario that exposes the cache-poisoning bug in _maybe_build.
pyproject.toml Adds xxhash>=3.0.0 as a core dependency; appropriate since change_detect is used in core modules.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NativeModule.start] --> B[_maybe_build]
    B --> C{rebuild_on_change\nset AND exe exists?}
    C -- No --> D{exe exists?}
    C -- Yes --> E["did_change(update=True ⚠️)\nWrites new hash to cache\nBEFORE build runs"]
    E -- False\nno change --> D
    E -- True\nchanged --> F[needs_rebuild = True]
    F --> G[Run build_command]
    D -- Yes, no rebuild --> H[return early / no build]
    D -- No, or needs rebuild --> G
    G -- returncode != 0 --> I["raise RuntimeError\n(cache already updated ⚠️)"]
    G -- success --> J[exe verified exists]
    J --> K["did_change again\n(redundant – cache already current)"]
    K --> L[return]
    I --> M["Next _maybe_build call:\ndid_change → False\n→ skips rebuild forever ⚠️"]

    style E fill:#ffcccc,stroke:#cc0000
    style I fill:#ffcccc,stroke:#cc0000
    style M fill:#ffcccc,stroke:#cc0000
Loading

Reviews (1): Last reviewed commit: "native rebuild changes" | Re-trigger Greptile

Comment thread dimos/core/native_module.py Outdated
Comment thread dimos/manipulation/planning/utils/mesh_utils.py Outdated
Comment thread dimos/utils/change_detect.py Outdated
Comment thread dimos/core/native_module.py Outdated
Comment thread dimos/utils/change_detect.py Outdated
try:
# Include the path so additions/deletions/renames are detected
h.update(str(fpath).encode())
h.update(fpath.read_bytes())
Copy link
Copy Markdown
Contributor

@leshy leshy Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will load a full file into memory, ok for source files, but will for sure cause a weird issue for someone down the line. maybe easy hack for this is to check the file size and simply not hash if larger then 500k-1Mb or something

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to streaming to reduce ram overhead, still runs fast (still 100x or 1000x nix's change checking)

Added a max_file_size arg, files greater than max_file_size will switch to using the last modified date instead of actually hashing the file. For rebuild_on_change the user is always expected to only mention source code so probably a non-problem but max_file_size should help this tool be more general.

@leshy
Copy link
Copy Markdown
Contributor

leshy commented Apr 14, 2026

so not sure, haven't looked into details, but is there a reason we aren't relying nix build before each module start, since nix essentially does the above by itself?

Comment thread dimos/core/native_module.py Outdated
Comment thread dimos/core/native_module.py
Comment thread dimos/core/native_module.py Outdated
Comment thread dimos/core/native_module.py Outdated
Comment thread dimos/core/native_module.py Outdated
extra_env: dict[str, str] = Field(default_factory=dict)
shutdown_timeout: float = 10.0
log_format: LogFormat = LogFormat.TEXT
rebuild_on_change: list[PathEntry] | None = None
Copy link
Copy Markdown
Contributor

@leshy leshy Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw this is the first actual setting that's different between prod and dev, we def don't want this on deployments, so probably if toggled True should be done not in code but the local config file level or via local env vars.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the difference between prod and dev? I don't follow

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On production the code doesn't change so there's no need to rebuild. (I assume this is what Ivan meant.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this hashes all the files in the dir which can be quite costly for larger files, def don't want to run in prod

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes hashing all files is computationally expensive and we don't want prod deployments to do this during the startup

Copy link
Copy Markdown
Member Author

@jeff-hykin jeff-hykin Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the files in the dir

No, it doesn't hash all the files in the dir

quite costly for larger files

It has the "if large use last edit time" feature, so no its not costly on large files (and why would people point it at large files?)

On my mac rebuild_on_change runs in under 1 millisecond for the largest nav native module. Absolutely unnoticeable.

I feel like you guys are just going off theory

@jeff-hykin
Copy link
Copy Markdown
Member Author

jeff-hykin commented Apr 14, 2026

so not sure, haven't looked into details, but is there a reason we aren't relying nix build before each module start, since nix essentially does the above by itself?

Unfortunately yeah. I don't care much about speed, but rebuild_on_change is 1980x faster than nix's check on reboot, and merely 280x faster once nix has a "warm" cache. And that's for the mid360, which is absolutely tiny. For a native module like PathFollower being checked onboard the G1, nix takes 1037.89ms just to confirm nothing changed (rebuild_on_change runs in 5ms on the G1).

I'm not thrilled about rebuild_on_change, but it was important for development on the G1.

Now that I think about it though, we should probably have it always rebuild unless there is a rebuild_on_change, since its more or less an optimization

Twelve fixes from Paul-style code review of #1780:

native_module.py:
- _maybe_build: use did_change(update=False) + post-success update_cache so
  failed builds never poison the rebuild cache
- delete stray "Source files changed" log line that fired on every call
- stop(): per-instance _stop_lock; capture proc/watchdog refs under the
  lock, do signal/wait/join outside the lock to avoid the watchdog-self-
  join deadlock; second caller no-ops via _stopping
- _read_log_stream: receive pid as a parameter instead of reaching into
  self._process.pid (TOCTOU with stop())
- _child_preexec_linux + ctypes hoisted to module scope under a
  sys.platform guard; no more re-import per start()
- _clear_nix_executable: gracefully handle cwd=None (skip the walk
  entirely), use .resolve() comparison so a symlinked cwd still
  terminates the walk, refuse to walk past cwd's tree

change_detect.py:
- fold max_file_size into _hash_files digest so different thresholds
  against the same cache_name can't corrupt each other
- new _locked_cache_file context manager — flock the .hash file directly
  in "a+" mode; no more orphan .lock sidecars accumulating in the cache
  dir; did_change/update_cache/clear_cache all share the helper

Tests:
- new test_should_rebuild_true_bypasses_change_check for the explicit
  "always rebuild" path
- new test_failed_build_does_not_mark_cache_clean for the update=False
  retry semantics

Build system:
- tiledb darwin overlay: filter patches by name instead of dropping all
  of them, so a future security patch from nixpkgs survives
- livox_sdk_config.hpp: honor $TMPDIR on Darwin instead of hardcoding /tmp

Perf script:
- compute cache_name inline (using the same inspect-based source_file
  pattern) instead of calling the inlined _build_cache_name method

All 26 tests across test_change_detect.py + test_native_module.py +
test_native_rebuild.py pass. ruff + mypy clean. mid360_native and
fastlio2_native still build end-to-end on aarch64-darwin. Linux drvPaths
for libpqxx/tiledb/pcl/vtk verified unchanged by the overlay.
@leshy
Copy link
Copy Markdown
Contributor

leshy commented Apr 16, 2026

I don't care much about speed, but rebuild_on_change is 1980x faster than nix's check on reboot, and merely 280x faster once nix has a "warm" cache. And that's for the mid360, which is absolutely tiny. For a native module like PathFollower being checked onboard the G1, nix takes 1037.89ms just to confirm nothing changed (rebuild_on_change runs in 5ms on the G1).

I'm not thrilled about rebuild_on_change, but it was important for development on the G1.

I feel like I'm missing something since 1 second doesn't actually sound like an issue to me. This runs only in dev envs right? Build systems often take way longer then this. Why was this important for g1 dev?

Change detection in the first place is a fancy feature and ssh g1:cd dimos/bla && make seems ok to me as well

I don't mind this code in - since it's a generic file change detector that we might want elsewhere as well (I'd probably rely on modification times though) I'm not pushing back against this, just trying to understand this work.

btw you can also use linux inotify api and autobuild

while inotifywait -r -e modify,create,delete src; do make; done

or to piggyback on nativemodule internal rebuild step, just clean up the build artifacts

while inotifywait -r -e modify,create,delete src; do echo "rebuild scheduled"; rm bin/result; done

Now that I think about it though, we should probably have it always rebuild unless there is a rebuild_on_change, since its more or less an optimization

Builds can get quite expensive so I wouldn't do this by default but I guess there are cases in which you want this yeah, in my mid360 dev I was doing rm ./bin/result && build by hand after cpp changes (to see compiler output), would do inotifywait if I wanted autobuilds

@paul-nechifor
Copy link
Copy Markdown
Contributor

I don't mind this code in - since it's a generic file change detector that we might want elsewhere as well

Personally, I don't like the idea of adding code under the justification that it could be useful later. Code always has maintenance costs, and less code is always better.

I've said this before, but I don't understand the need for all this change detecting code. Build systems should already do this. It could be justifiable if it was intolerably slow, but nix taking 1037.89ms doesn't seem bad.

@leshy
Copy link
Copy Markdown
Contributor

leshy commented Apr 17, 2026

I don't mind this code in - since it's a generic file change detector that we might want elsewhere as well

Personally, I don't like the idea of adding code under the justification that it could be useful later. Code always has maintenance costs, and less code is always better.

yeah I agree actually

I've said this before, but I don't understand the need for all this change detecting code. Build systems should already do this. It could be justifiable if it was intolerably slow, but nix taking 1037.89ms doesn't seem bad.

yeah I'm also confused, seems like 1500 lines to shave off one second in dev, introduces dev/prod env config split etc

@jeff-hykin
Copy link
Copy Markdown
Member Author

yes hashing all files is computationally expensive and we don't want prod deployments to do this during the startup

I feel like I'm missing something since 1 second doesn't actually sound like an issue to me.

how can these two things be true at the same time?

@leshy
Copy link
Copy Markdown
Contributor

leshy commented Apr 17, 2026

yes hashing all files is computationally expensive and we don't want prod deployments to do this during the startup

I feel like I'm missing something since 1 second doesn't actually sound like an issue to me.

how can these two things be true at the same time?

it's prod vs dev

in dev - if rebuild in dev takes 1 second to start is a non issue

in prod deployments we are not rebuilding stuff, it's important that we don't re-compute a hash of every file in every native module for no reason (I'm assuming native modules can get pretty large and we'll lose startup time and compute every single time dimos runs)

@jeff-hykin
Copy link
Copy Markdown
Member Author

it could be useful later

This PR already has another place where the tooling is usedmesh_utils.py

@jeff-hykin
Copy link
Copy Markdown
Member Author

jeff-hykin commented Apr 17, 2026

At the time this PR was created (and still an open question) not every build command is guareteed to use nix. So I figured it'd be nice to both: not call the build command (whatever it might be) merely never or always -- but rather when something changed.

My timeline:

  1. manual rebuild is painful on G1 nav stack
  2. I add attempt rebuild every time
  3. (just for PR) I expect you guys would say rebuild every time is too slow, so I add rebuild_on_change
  4. (just for PR) I expect you guys to say "file hashing is not specific to native modules please put it in a util" so I put it in a util
  5. (just for PR) I expect the util to not be used, so I change mesh-whatever to use it
  6. (just for PR) I want to make this easier to review so I break it out into a separate PR
  7. (just for PR) wait a week between feedback

I'm just trying to get nav merged, I'm doing everything I can to be as generic/efficient/modular/small-pr's as possible

we can revert this in 20min a week from now, I don't care about the feature itself.

If you guys want a PROD=true env var that bypasses everything, that sounds fine to me.

# modules locally while the coordinator tries to send stop() RPCs, causing
# BrokenPipeErrors.
signal.signal(signal.SIGINT, signal.SIG_IGN)
instances: dict[int, Any] = {}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a problem that showed up more in native modules, but it is technically not related to native module rebuild. Let me know if this should be a one file PR


# ctypes is only needed for the Linux child-preexec helper below. Hoisting
# the import out of the inner function avoids re-importing on every start().
if sys.platform.startswith("linux"):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for better native module cleanup on shutdown

// mkstemp replaces the 6 X's in place — e.g. livox_sdk_config.aB3xY9.
// Honor $TMPDIR when set (sandboxed macOS apps and CI runners point
// it at a per-process scratch dir); fall back to /tmp.
const char* tmpdir = std::getenv("TMPDIR");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad transfer, this should be in the rosnav merge, will remove (its macos support for livox)

@jeff-hykin jeff-hykin closed this Apr 17, 2026
auto-merge was automatically disabled April 17, 2026 05:01

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants