benchdnn: add GatedMLP driver by kwieloch-intel · Pull Request #4951 · uxlfoundation/oneDNN

kwieloch-intel · 2026-04-03T11:25:59Z

🚧 WIP — not ready for review

This PR adds a new benchdnn driver for the Gated MLP primitive, enabling correctness validation and performance benchmarking of the fused GatedMLP GPU kernel directly from the benchdnn application.

🔍 Problem description

oneDNN includes a fused GPU GatedMLP kernel (ocl:ref:any) but has no dedicated benchdnn driver to test it. Other primitives (matmul, softmax, eltwise, etc.) all have benchdnn drivers. This means:

No performance benchmarking through benchdnn's --mode=F/P infrastructure.
No correctness validation against a reference (only internal gtests).
No CI coverage for GatedMLP across data types, activations, and LLM-scale shapes.

💡Proposed Solution

Implement a new benchdnn driver --gated_mlp that calls the dnnl_gated_mlp_primitive_desc_create() API and validates GPU output against a CPU reference. The driver implements the full GatedMLP operation:

$$\text{DST} = \left(\text{activation}(\text{SRC} \times W_{\text{gate}}) \odot (\text{SRC} \times W_{\text{up}})\right) \times W_{\text{down}}$$

Feature coverage:
- Data types: f32, f16, bf16 with per-tensor broadcast (--dt=f16 or --dt=f16:f16:f16:f16:f32 for SRC, W_GATE, W_UP, W_DOWN, DST).
- Activations: swish (default), gelu_erf, gelu_tanh.
- Memory tags: --stag, --wtag (shared for all 3 weight tensors), --dtag.
- Problem descriptor: MBxICxOC — all tensor shapes derived from 3 dimensions.

Reference (ref_gated_mlp.cpp) generates gold data by composing existing oneDNN primitives on the CPU:

up   = matmul(SRC, W_up)           — oneDNN matmul primitive
gate = matmul(SRC, W_gate)         — oneDNN matmul primitive
gate = activation(gate)            — oneDNN eltwise primitive
gate = gate ⊙ up                   — element-wise multiply
DST  = matmul(gate, W_down)        — oneDNN matmul primitive

CI test suites (limited to single-shape per file due to known GPU implementation bug, see below):

shapes_basic          — 6 shapes: 32x32x32 to 64x896x4864
shapes_llm            — 4 real-world LLM shapes (e.g. 1024x3584x18944)
test_gated_mlp_smoke  — f16 single shape
test_gated_mlp_ci     — f32 single shape
test_gated_mlp_gpu    — f16 LLM-scale shape

Scope limited to tests/benchdnn/gated_mlp, zero changes to src/.

gated_mlp.hpp        — Problem descriptor (prb_t), settings, enums, cfg, perf report
gated_mlp.cpp        — Core driver: init_pd, fill_data, init_ref_memory_args, doit, setup_cmp
ref_gated_mlp.cpp    — Reference via oneDNN matmul + eltwise primitives on CPU
gated_mlp_aux.cpp    — String conversions, get_md, get_dt, set_repro_line
bench_gated_mlp.cpp  — CLI parser with --activation knob
cfg.cpp              — Fill ranges for SRC, WEI, DST data kinds

Documentation and integration updates in tests/benchdnn:

benchdnn.cpp         — #include + --gated_mlp dispatcher
parser.cpp           — Help text entry for --gated_mlp
README.md            — Driver list entry
doc/driver_gated_mlp.md — Full driver documentation

⚠️ Known GPU implementation bug

The GPU ocl:ref:any gated_mlp implementation has a bug: executing multiple primitives with different shapes in the same process produces incorrect results (NaN, garbage) or CL_INVALID_KERNEL_ARGS (errcode -52). Each shape works correctly in isolation. This is a bug in src/gpu/intel/gated_mlp/ref.hpp, not in this driver.

The CI test files are limited to one shape per file as a workaround. Full multi-shape test configurations are included as comments with a TODO to uncomment once the GPU implementation is fixed.

📈 Results

DG2 (Intel Arc A770)

Each test suite is run as a separate process (one shape per invocation) to avoid the GPU sequential execution bug.

Test Suite	Total	Passed
`test_gated_mlp_smoke` (f16 32x32x32)	1	1
`test_gated_mlp_ci` (f32 128x256x512)	1	1
`test_gated_mlp_gpu` (f16 64x896x4864)	1	1
Manual: f32 all activations	3	3
Manual: f16 all activations	3	3
Manual: bf16 all activations	3	3
Manual: f16 LLM shapes	4	4
Manual: bf16 LLM 1024x896x4864	1	1

Example repro commands (click to open)

# CI test suites
benchdnn --gated_mlp --engine=gpu --batch=inputs/gated_mlp/test_gated_mlp_smoke
benchdnn --gated_mlp --engine=gpu --batch=inputs/gated_mlp/test_gated_mlp_ci
benchdnn --gated_mlp --engine=gpu --batch=inputs/gated_mlp/test_gated_mlp_gpu

# Manual single-shape tests (all pass)
benchdnn --gated_mlp --engine=gpu --dt=f32 --activation=swish 128x256x512
benchdnn --gated_mlp --engine=gpu --dt=f16 --activation=gelu_erf 64x128x256
benchdnn --gated_mlp --engine=gpu --dt=bf16 --activation=gelu_tanh 1024x896x4864

# Performance benchmark
benchdnn --mode=F --gated_mlp --engine=gpu --dt=f16 1024x896x4864

Pull Request Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

New features

~~Have you published an RFC for the new feature?~~ (N/A — benchdnn driver only)
~~Was the RFC approved?~~ (N/A)
Have you added relevant tests?
Have you added relevant documentation? (doc/driver_gated_mlp.md)

- Introduced Gated MLP driver with functionality for correctness checking and input verification. - Implemented configuration handling for Gated MLP, including data type and activation function settings. - Developed reference implementation for Gated MLP using existing oneDNN primitives (matmul, eltwise). - Added auxiliary functions for activation string conversion and memory management. - Created test cases for various input shapes, including basic, LLM-scale, and CI configurations.

kwieloch-intel added 3 commits April 2, 2026 12:29

benchdnn: Update Gated MLP test configurations and docs

fc911a6

benchdnn: Limit Gated MLP tests due bugs

7f902e7

github-actions bot added documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc component:tests Codeowner: @oneapi-src/onednn-arch labels Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchdnn: add GatedMLP driver#4951

benchdnn: add GatedMLP driver#4951
kwieloch-intel wants to merge 3 commits intouxlfoundation:mainfrom
kwieloch-intel:benchdnn-gated-mlp

kwieloch-intel commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kwieloch-intel commented Apr 3, 2026

🚧 WIP — not ready for review

This PR adds a new benchdnn driver for the Gated MLP primitive, enabling correctness validation and performance benchmarking of the fused GatedMLP GPU kernel directly from the benchdnn application.

🔍 Problem description

💡Proposed Solution

⚠️ Known GPU implementation bug

📈 Results

Pull Request Checklist

General

New features

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant