Skip to content

(Question) Does SequentialOrderManager track externally-submitted work on a shared sycl::queue #2293

@abagusetty

Description

@abagusetty

@ndgrigorian @vlad-perevezentsev @antonwolfy @alvarovm

Just wanted to reach out related to a setup we have been working with an application using cupy and dpnp (porting).

App: Heavily uses cupy and hand-written CUDA kernels. These were ported to DPNP and SYCL kernels for Intel PVC (on Aurora).

Intent: The app relies on the use of stream = cupy.cuda.get_current_stream() , stream.synchronize(), and passing the stream obtained by the above to pass it down to user CUDA kernels. A pipeline methodology where a given instance/process is tied to a single stream through the upstream cupy to downstream user CUDA kernels.

With the port to DPNP & SYCL (currently), we have relied on using a single queue written for user kernels and dpnp/dpctl ops wasnt aware of this queue. Since they were all created with the same default context (using ZE_SERIALIZE=2, OverrideImmediateCmdListSynchronousMode=1) helped with the initial correctness testing. Since everything got synchronized and default-context shared by any number of queues wasnt a problem.

Goal: The intent now is to move from the above design and explore if upstream dpctl provides us a single master (in-order) queue per-device q = dpctl.SyclQueue(devs[device_id], property="in_order") such that we can use it to the downstream user SYCL kernels as well. Since everything is in-order we expect similar FIFO behavior as cudaStream

Few questions if you can provide any pointers:
0. Would the above mentioned design be feasible.

  1. Since there is no such equivalent as stream = cupy.cuda.get_current_stream(), we have tried something like from ._sycl_queue_manager import get_device_cached_queue but the obtained queue from get_device_cached_queue is inherently an OoO queue which doesnt really suite our broader needs in the user-app. Since we are trying to mimic the infrastructure based on cudaStream FIFO. So we had to start using q = dpctl.SyclQueue(devs[device_id], property="in_order").
  2. Does SequentialOrderManager track externally-submitted work on a dpctl obtained sycl::queue

More details:
Nothing in SequentialOrder observes the queue itself. So if a user extension does:

pythonq = dpctl.SyclQueue()
a = dpt.arange(..., sycl_queue=q)          # dpctl registers events
my_ext.launch_kernel(q, a.usm_data, ...)   # user submits directly, NOT registered
b = dpt.sum(a, ...)                        # dpctl uses its tracked deps only

then the my_ext.launch_kernel event is invisible to the manager. A subsequent dpt.sum submission's depends_on will not include it, and SequentialOrderManager[q].wait() will return before the user kernel completes.
Is the following characterization correct?

In-order queue: user-submitted kernels are still correctly ordered against later dpctl submissions because the queue itself serializes, so the data dependency is safe. But SequentialOrderManager[q].wait() may return early because it only waits on events it knows about.
Out-of-order queue: nothing orders the user's kernel against later dpctl submissions, so without the user explicitly passing som.submitted_events as deps and registering their own event back, this is a genuine race on shared USM.

I understand this is a lot and would appreciate any feedback. There are a rich set of DPCPP extensions too that can help in making this robust and lightweight without explicit book-keeping and would be more than welcome to contribute if there is sufficient interest

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions