@ndgrigorian @vlad-perevezentsev @antonwolfy @alvarovm
Just wanted to reach out related to a setup we have been working with an application using cupy and dpnp (porting).
App: Heavily uses cupy and hand-written CUDA kernels. These were ported to DPNP and SYCL kernels for Intel PVC (on Aurora).
Intent: The app relies on the use of stream = cupy.cuda.get_current_stream() , stream.synchronize(), and passing the stream obtained by the above to pass it down to user CUDA kernels. A pipeline methodology where a given instance/process is tied to a single stream through the upstream cupy to downstream user CUDA kernels.
With the port to DPNP & SYCL (currently), we have relied on using a single queue written for user kernels and dpnp/dpctl ops wasnt aware of this queue. Since they were all created with the same default context (using ZE_SERIALIZE=2, OverrideImmediateCmdListSynchronousMode=1) helped with the initial correctness testing. Since everything got synchronized and default-context shared by any number of queues wasnt a problem.
Goal: The intent now is to move from the above design and explore if upstream dpctl provides us a single master (in-order) queue per-device q = dpctl.SyclQueue(devs[device_id], property="in_order") such that we can use it to the downstream user SYCL kernels as well. Since everything is in-order we expect similar FIFO behavior as cudaStream
Few questions if you can provide any pointers:
0. Would the above mentioned design be feasible.
- Since there is no such equivalent as
stream = cupy.cuda.get_current_stream(), we have tried something like from ._sycl_queue_manager import get_device_cached_queue but the obtained queue from get_device_cached_queue is inherently an OoO queue which doesnt really suite our broader needs in the user-app. Since we are trying to mimic the infrastructure based on cudaStream FIFO. So we had to start using q = dpctl.SyclQueue(devs[device_id], property="in_order").
- Does
SequentialOrderManager track externally-submitted work on a dpctl obtained sycl::queue
More details:
Nothing in SequentialOrder observes the queue itself. So if a user extension does:
pythonq = dpctl.SyclQueue()
a = dpt.arange(..., sycl_queue=q) # dpctl registers events
my_ext.launch_kernel(q, a.usm_data, ...) # user submits directly, NOT registered
b = dpt.sum(a, ...) # dpctl uses its tracked deps only
then the my_ext.launch_kernel event is invisible to the manager. A subsequent dpt.sum submission's depends_on will not include it, and SequentialOrderManager[q].wait() will return before the user kernel completes.
Is the following characterization correct?
In-order queue: user-submitted kernels are still correctly ordered against later dpctl submissions because the queue itself serializes, so the data dependency is safe. But SequentialOrderManager[q].wait() may return early because it only waits on events it knows about.
Out-of-order queue: nothing orders the user's kernel against later dpctl submissions, so without the user explicitly passing som.submitted_events as deps and registering their own event back, this is a genuine race on shared USM.
I understand this is a lot and would appreciate any feedback. There are a rich set of DPCPP extensions too that can help in making this robust and lightweight without explicit book-keeping and would be more than welcome to contribute if there is sufficient interest
@ndgrigorian @vlad-perevezentsev @antonwolfy @alvarovm
Just wanted to reach out related to a setup we have been working with an application using cupy and dpnp (porting).
App: Heavily uses cupy and hand-written CUDA kernels. These were ported to DPNP and SYCL kernels for Intel PVC (on Aurora).
Intent: The app relies on the use of
stream = cupy.cuda.get_current_stream(),stream.synchronize(), and passing the stream obtained by the above to pass it down to user CUDA kernels. A pipeline methodology where a given instance/process is tied to a single stream through the upstream cupy to downstream user CUDA kernels.With the port to DPNP & SYCL (currently), we have relied on using a single queue written for user kernels and dpnp/dpctl ops wasnt aware of this queue. Since they were all created with the same default context (using
ZE_SERIALIZE=2, OverrideImmediateCmdListSynchronousMode=1) helped with the initial correctness testing. Since everything got synchronized and default-context shared by any number of queues wasnt a problem.Goal: The intent now is to move from the above design and explore if upstream
dpctlprovides us a single master (in-order) queue per-deviceq = dpctl.SyclQueue(devs[device_id], property="in_order")such that we can use it to the downstream user SYCL kernels as well. Since everything is in-order we expect similar FIFO behavior ascudaStreamFew questions if you can provide any pointers:
0. Would the above mentioned design be feasible.
stream = cupy.cuda.get_current_stream(), we have tried something likefrom ._sycl_queue_manager import get_device_cached_queuebut the obtained queue fromget_device_cached_queueis inherently an OoO queue which doesnt really suite our broader needs in the user-app. Since we are trying to mimic the infrastructure based oncudaStreamFIFO. So we had to start usingq = dpctl.SyclQueue(devs[device_id], property="in_order").SequentialOrderManagertrack externally-submitted work on a dpctl obtainedsycl::queueMore details:
Nothing in SequentialOrder observes the queue itself. So if a user extension does:
then the
my_ext.launch_kernelevent is invisible to the manager. A subsequentdpt.sumsubmission's depends_on will not include it, andSequentialOrderManager[q].wait()will return before the user kernel completes.Is the following characterization correct?
In-order queue: user-submitted kernels are still correctly ordered against later dpctl submissions because the queue itself serializes, so the data dependency is safe. But
SequentialOrderManager[q].wait()may return early because it only waits on events it knows about.Out-of-order queue: nothing orders the user's kernel against later dpctl submissions, so without the user explicitly passing
som.submitted_eventsas deps and registering their own event back, this is a genuine race on shared USM.I understand this is a lot and would appreciate any feedback. There are a rich set of DPCPP extensions too that can help in making this robust and lightweight without explicit book-keeping and would be more than welcome to contribute if there is sufficient interest