Add program caches (in-memory, sqlite, filestream)#1912
Add program caches (in-memory, sqlite, filestream)#1912cpcloud wants to merge 17 commits intoNVIDIA:mainfrom
Conversation
de57bd8 to
ac38a68
Compare
|
f1ae40e to
b27ed2c
Compare
2dc5c8f to
5da111b
Compare
|
Generated with the help of Cursor GPT-5.4 Extra High Fast High:
|
|
Thanks, Phillip! I have this PR in my review backlog 🙏 The most important question: Are these cache implementations multithreading/multiprocessing safe? This is the key challenge that real-world apps will stress test. In CuPy, our on-disk cache has been stress-tested in DOE supercomputers. |
3a32786 to
cad93d0
Compare
|
Addressed in ff886d3585 (fixes) and cad93d0 (refactor + star-import note). High -- source-directory include. Medium -- over-eviction race. Low -- star-import. Added a note in |
|
@leofang -- yes, all three backends are designed and tested for concurrent access, with different scopes:
Cross-process coverage in
One concurrency bug this review shook out (over-eviction after a suppressed |
457cab7 to
cfddd08
Compare
Convert cuda.core.utils to a package and add ObjectCode caches for
artifacts produced by Program.compile.
Public API (cuda.core.utils):
* ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping
with context manager. Path-backed ObjectCode is rejected at write
time (would store only the path, not the bytes).
* InMemoryProgramCache -- in-process OrderedDict backend that
stores entries by reference (no pickling). Optional max_entries
and max_size_bytes caps with LRU eviction. __getitem__ promotes
LRU; __contains__ is read-only. threading.RLock serialises every
method.
* SQLiteProgramCache -- single-file sqlite3 backend (WAL mode,
autocommit) with LRU eviction and an optional size cap. A
threading.RLock serialises connection use so one cache object is
safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after
evictions so the cap bounds real on-disk usage. __contains__ is
read-only. __len__ prunes corrupt rows. Schema-mismatch on open
drops tables and rebuilds; corrupt / non-SQLite files reinitialise
empty; transient OperationalError propagates without nuking the
file (and closes the partial connection).
* FileStreamProgramCache -- directory of atomically-written entries
(tmp + os.replace) safe across concurrent processes. blake2b(32)
hashed filenames so arbitrary-length keys never overflow
filesystem limits. Reader pruning, clear(), and _enforce_size_cap
are all stat-guarded (inode/size/mtime snapshot; refuse unlink on
mismatch) so a concurrent writer's os.replace is preserved.
_enforce_size_cap also decrements its running ``total`` when a
concurrent deleter wins the unlink race, so a suppressed
FileNotFoundError cannot over-evict newly committed entries.
Stale temp files swept on open; live temps count toward the size
cap. Windows ERROR_SHARING_VIOLATION (32) and ERROR_LOCK_VIOLATION
(33) on os.replace are retried with bounded backoff (~185ms)
before being treated as a non-fatal cache miss; other
PermissionError and all POSIX failures propagate.
* make_program_cache_key -- stable 32-byte blake2b digest over code,
code_type, ProgramOptions, target_type, name expressions, and
environment probes: cuda-core version, NVRTC version, NVVM lib+IR
version, linker backend+version for PTX inputs (driver version
only on the cuLink path). Backend-specific gates mirror
Program/Linker:
- code_type lower-cased to match Program_init.
- code_type/target_type validated against Program's
SUPPORTED_TARGETS matrix.
- NVRTC side-effect options (create_pch, time,
fdevice_time_trace) and external-content options
(include_path, pre_include, pch, use_pch, pch_dir) require
an extra_digest. NVVM use_libdevice=True likewise. NVRTC
options.name with a directory component (e.g. '/abs/k.cu')
also requires extra_digest (or no_source_include=True) because
NVRTC searches that directory for #include \"...\" lookups;
bare labels fall back to CWD and stay accepted.
- extra_sources rejected for non-NVVM; bytes-like ``code``
rejected for non-NVVM.
- PTX (Linker) options pass through per-field gates that match
_prepare_nvjitlink_options / _prepare_driver_options;
ptxas_options canonicalised across str/list/tuple/empty
shapes; driver-linker hard rejections (time, ptxas_options,
split_compile) raise at key time; ftz/prec_div/prec_sqrt/fma
collapse under the driver linker.
- name_expressions gated on backend == \"nvrtc\".
- Failed environment probes mix the exception class name into a
*_probe_failed label so broken environments never collide
with working ones while staying stable across processes and
repeated calls.
Lazy import: ``from cuda.core.utils import StridedMemoryView`` does
not pull in any cache backend. The cache classes and
make_program_cache_key are exposed via module __getattr__.
_LAZY_CACHE_ATTRS is a single ordered tuple spliced into __all__ via
``*_LAZY_CACHE_ATTRS`` so the two lists cannot drift; star-import
still walks __all__ and therefore resolves every lazy attribute,
which is expected given star-imports are discouraged anyway.
sqlite3 is imported lazily inside SQLiteProgramCache.__init__ so the
package is usable on interpreters built without libsqlite3.
Tests: ~200 cache tests covering single-process CRUD for all three
backends; LRU/size-cap (logical and on-disk, including stat-guarded
race scenarios); over-eviction race (monkeypatched Path.unlink);
InMemory combined caps, overwrite-updates-size, LRU-touch-on-read,
contains-does-not-bump, degenerate caps (single entry > cap,
max_entries=0); NVRTC source-directory path-name guard with
POSIX/Windows separators and both accept paths; corruption +
__len__ pruning; schema-mismatch table-DROP; threaded SQLite and
InMemory (4 writers + 4 readers x 200 ops); cross-process
FileStream stress (writer/reader race exercising the stat-guard
prune; clear/eviction race injection via generator cleanup);
Windows vs POSIX PermissionError narrowing (winerror 32/33 swallow
+ retry, others propagate; partial-conn close on OperationalError);
lazy-import subprocess test; _SUPPORTED_TARGETS_BY_CODE_TYPE parity
test that parses _program.pyx via tokenize + ast.literal_eval; and
end-to-end real CUDA C++ compile -> store -> reopen -> get_kernel
roundtrip parametrized over the two persistent backends.
Closes NVIDIA#177
Closes NVIDIA#178
Closes NVIDIA#179
cfddd08 to
fce123f
Compare
…m._options is C-level)
Summary
cuda.core.utilsfrom a module to a package; expose cache APIs lazily via__getattr__sofrom cuda.core.utils import StridedMemoryViewstays lightweight._LAZY_CACHE_ATTRSis a single ordered tuple spliced into__all__via*_LAZY_CACHE_ATTRS, and the module docstring notes that the laziness guarantee is for explicit imports only (star-import walks__all__and therefore resolves every lazy attribute).ProgramCacheResourceABC withbytes | strkeys, context manager, pickle-safety warning, and rejection of path-backedObjectCodeat write time.make_program_cache_key()— blake2b(32) digest with backend-specific gates that mirrorProgram/Linker:code_type/target_typeagainstProgram.compile'sSUPPORTED_TARGETS; rejects bytes-likecodefor non-NVVM andextra_sourcesfor non-NVVM.create_pch,time,fdevice_time_trace) and external-content (include_path,pre_include,pch,use_pch,pch_dir) options requireextra_digest; NVVMuse_libdevice=Truelikewise.options.namewith a directory component (e.g./path/to/kernel.cu) also requiresextra_digestbecause NVRTC searches that directory for#include "..."lookups; bare labels ("default_program","kernel-a") fall back to CWD and stay accepted.no_source_include=Truedisables the search and the guard._prepare_nvjitlink_options/_prepare_driver_options;ptxas_optionscanonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time,ptxas_options,split_compile) raise at key time;ftz/prec_div/prec_sqrt/fmacollapse under driver linker.*_probe_failedlabel so broken environments never collide with working ones, while staying stable across processes and repeated calls.InMemoryProgramCache,SQLiteProgramCache,FileStreamProgramCache— all of which implementProgramCacheResource. See Backends below for design, benefits, and tradeoffs of each.Program.compile(cache=...)as the one-line convenience entry point: derives the key viamake_program_cache_key, returns the cachedObjectCodeon hit, compiles + stores on miss. Options that requireextra_digest(include_path,pre_include,pch,use_pch,pch_dir, NVVMuse_libdevice=True, NVRTCoptions.namewith a directory component) raise a clearValueErrorpointing callers at the manualmake_program_cache_key(...)pattern for those cases.Backends
All three implement
ProgramCacheResourceand share the key schema. The two persistent backends pickleObjectCodeatpickle.HIGHEST_PROTOCOL; the in-memory backend stores it by reference. They differ in storage, concurrency model, and eviction policy.InMemoryProgramCacheOrderedDict(no pickling)RLock)max_entries+max_size_bytesSQLiteProgramCacheRLock); multi-process possible but not the recommended shapeaccessed_atupdated on reads); hardmax_size_bytesat quiescent pointsFileStreamProgramCacheos.replace; stat-guarded prunesmtime(oldest written); softmax_size_bytesInMemoryProgramCacheDesign
collections.OrderedDictmapping key-digest →(ObjectCode, size). Insertion order encodes LRU — oldest at the front, newest at the back. Values are stored by reference (no pickle round-trip), which is why lookups are the fastest of the three.__getitem__moves the entry to the back to promote it.__contains__is read-only, so a membership probe doesn't shift LRU order.__setitem__updates the entry and then calls_evict_to_caps(), which pops from the front until both optional caps (max_entries,max_size_bytes) are satisfied.threading.RLockserialises every method, so a reader's LRU bump and a writer's eviction can't interleave.Benefits
Tradeoffs
ObjectCodemutates the cached entry.Use when artifacts only need to live for the lifetime of the process.
SQLiteProgramCacheDesign
entriestable: blake2b key-digest PK (BLOB), pickledObjectCodepayload (BLOB),size_bytes,created_at,accessed_at(REAL), with an index onaccessed_atfor LRU scans.schema_metatable records_SQLITE_SCHEMA_VERSION.accessed_at— so eviction always removes the genuinely least-recently-used row.max_size_bytesis set, delete from the head ofORDER BY accessed_at ASCuntil the running sum is under the cap, then runwal_checkpoint(TRUNCATE) + VACUUMto reclaim disk.threading.RLockserialises connection use;check_same_thread=Falselets one cache move between threads.DatabaseError(corruption-shaped) wipes the DB plus its-wal/-shmcompanions and reinitialises empty;OperationalError(lock/busy) propagates without nuking the file and closes any partial connection.Benefits
wal_checkpoint(TRUNCATE) + VACUUMbounds real on-disk size after evictions.Tradeoffs
VACUUM/wal_checkpoint(TRUNCATE)are skipped while any reader or writer is active, so on-disk size drifts abovemax_size_bytesuntil activity settles. For strict on-disk bounds under concurrent load,FileStreamProgramCacheis the right backend.InMemoryProgramCache).Use when you want single-process persistent caching under a hard size cap where eviction should reflect actual access frequency rather than write order. The unique win over
FileStreamProgramCacheis read-aware LRU.FileStreamProgramCacheDesign
<root>/entries/<blake2b-digest>, holding a pickled(schema, stored_key, payload, created_at)record wherepayloadis the pickledObjectCode. A siblingSCHEMA_VERSIONfile records_FILESTREAM_SCHEMA_VERSION; a mismatch wipes incompatible entries on open.stored_keyagainst the requested key — so a hash collision surfaces as a key mismatch, not silent corruption.<root>/tmp/<uuid>,fsync, thenos.replaceinto place. Readers never observe a partial entry. On Windows,os.replaceretries with bounded backoff (~185 ms) onERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATIONbefore dropping to a non-fatal cache miss._enforce_size_cap()lists entries with a stat snapshot, sorts bymtime, and unlinks oldest-first. Each unlink is stat-guarded —_prune_if_stat_unchanged()compares(ino, size, mtime_ns)against the snapshot and refuses if they differ, so a fresh entry a peer just committed viaos.replacesurvives eviction. The runningtotaldecrements whenever a peer wins the unlink race, so over-eviction after a suppressedFileNotFoundErrorcan't cascade.Benefits
Tradeoffs
mtime, so under heavy read reuse a hot entry can be dropped because it was written earliest.max_size_bytescap is soft; concurrent writers may briefly exceed it.fsynconly; the containing directory is notfsync-ed, so a host crash between write and the next directory commit may lose recently added entries. Surviving entries remain consistent.Use when multiple processes may hit the cache: parallel build workers, pytest-xdist, distributed training launchers, or any setup with several writers against one cache.
Examples
Program.compileaccepts acache=keyword that integrates with anyProgramCacheResource, so the typical pattern is a single call that transparently handles key derivation, lookup, and store on miss. The call shape below is identical for all three backends.Options that require
extra_digest(include_path,pre_include,pch,use_pch,pch_dir, NVVMuse_libdevice=True, NVRTCoptions.namewith a directory component) raiseValueErrorfrommake_program_cache_key— those compiles need the manualmake_program_cache_key(..., extra_digest=...)+cache.get/cache[key] = ...pattern.The differences between the backends are in how each is constructed and what guarantees it offers.
In-process hot loop —
InMemoryProgramCacheNotebook or REPL compiling many kernel variants (parameter sweeps, autotuning). Fastest, lives for the process.
Per-user persistent cache —
SQLiteProgramCacheSingle-user CLI tool or long-running service on one machine. One file on disk, reopen across runs, read-aware LRU so hot entries survive eviction.
Parallel workers —
FileStreamProgramCachepytest-xdist, CI matrix, or any multi-process build system. Every worker opens the same directory; atomic
os.replacecommits keep concurrent writers safe.Read-aware vs write-order LRU
The two persistent backends diverge when
max_size_bytesis tight and one entry is being re-read while others are being written:For read-heavy single-process workloads,
SQLiteProgramCachekeeps the hot entry alive. For multi-process workloads, the lack of cross-process LRU coordination is what makesFileStreamProgramCachesafe under concurrent writers — the tradeoff usually goes that way.Test plan
~200 cache tests total, grouped as:
InMemoryProgramCache: combined caps, overwrite-updates-size, LRU-touch-on-read, contains-does-not-bump-LRU, degenerate caps (single entry > cap,max_entries=0)__len__pruning of bad rows/filesSQLiteProgramCacheopenSQLiteProgramCachestress (4 writers + 4 readers × 200 ops)InMemoryProgramCachestressFileStreamProgramCachestress: writer/reader race exercising the stat-guard prune;clear()/ eviction race injection via generator cleanupPath.unlinksimulates a concurrent deleter winning exactly once, asserts the fresh entry survivesPermissionErrornarrowing: winerror 32/33 swallow + retry, all other codes propagate; partial-connection close onOperationalErrorfrom cuda.core.utils import StridedMemoryViewdoesn't pull in the cache modules_SUPPORTED_TARGETS_BY_CODE_TYPEparity test parses_program.pyxviatokenize+ast.literal_evalto keep the cache-key validator in sync withProgram.compile's supported-target mapget_kernelon the deserialisedObjectCode, parametrized over the two persistent backendsCloses #176
Closes #177
Closes #178
Closes #179