Skip to content

benchdnn: add GatedMLP driver#4951

Draft
kwieloch-intel wants to merge 3 commits intouxlfoundation:mainfrom
kwieloch-intel:benchdnn-gated-mlp
Draft

benchdnn: add GatedMLP driver#4951
kwieloch-intel wants to merge 3 commits intouxlfoundation:mainfrom
kwieloch-intel:benchdnn-gated-mlp

Conversation

@kwieloch-intel
Copy link
Copy Markdown
Contributor

🚧 WIP — not ready for review

This PR adds a new benchdnn driver for the Gated MLP primitive, enabling correctness validation and performance benchmarking of the fused GatedMLP GPU kernel directly from the benchdnn application.

JIRA: MFDNN-14716


🔍 Problem description

oneDNN includes a fused GPU GatedMLP kernel (ocl:ref:any) but has no dedicated benchdnn driver to test it. Other primitives (matmul, softmax, eltwise, etc.) all have benchdnn drivers. This means:

  • No performance benchmarking through benchdnn's --mode=F/P infrastructure.
  • No correctness validation against a reference (only internal gtests).
  • No CI coverage for GatedMLP across data types, activations, and LLM-scale shapes.

💡Proposed Solution

Implement a new benchdnn driver --gated_mlp that calls the dnnl_gated_mlp_primitive_desc_create() API and validates GPU output against a CPU reference. The driver implements the full GatedMLP operation:

$$\text{DST} = \left(\text{activation}(\text{SRC} \times W_{\text{gate}}) \odot (\text{SRC} \times W_{\text{up}})\right) \times W_{\text{down}}$$

  • Feature coverage:
    • Data types: f32, f16, bf16 with per-tensor broadcast (--dt=f16 or --dt=f16:f16:f16:f16:f32 for SRC, W_GATE, W_UP, W_DOWN, DST).
    • Activations: swish (default), gelu_erf, gelu_tanh.
    • Memory tags: --stag, --wtag (shared for all 3 weight tensors), --dtag.
    • Problem descriptor: MBxICxOC — all tensor shapes derived from 3 dimensions.
  • Reference (ref_gated_mlp.cpp) generates gold data by composing existing oneDNN primitives on the CPU:
    up   = matmul(SRC, W_up)           — oneDNN matmul primitive
    gate = matmul(SRC, W_gate)         — oneDNN matmul primitive
    gate = activation(gate)            — oneDNN eltwise primitive
    gate = gate ⊙ up                   — element-wise multiply
    DST  = matmul(gate, W_down)        — oneDNN matmul primitive
    
  • CI test suites (limited to single-shape per file due to known GPU implementation bug, see below):
    shapes_basic          — 6 shapes: 32x32x32 to 64x896x4864
    shapes_llm            — 4 real-world LLM shapes (e.g. 1024x3584x18944)
    test_gated_mlp_smoke  — f16 single shape
    test_gated_mlp_ci     — f32 single shape
    test_gated_mlp_gpu    — f16 LLM-scale shape
    
  • Scope limited to tests/benchdnn/gated_mlp, zero changes to src/.
    gated_mlp.hpp        — Problem descriptor (prb_t), settings, enums, cfg, perf report
    gated_mlp.cpp        — Core driver: init_pd, fill_data, init_ref_memory_args, doit, setup_cmp
    ref_gated_mlp.cpp    — Reference via oneDNN matmul + eltwise primitives on CPU
    gated_mlp_aux.cpp    — String conversions, get_md, get_dt, set_repro_line
    bench_gated_mlp.cpp  — CLI parser with --activation knob
    cfg.cpp              — Fill ranges for SRC, WEI, DST data kinds
    
  • Documentation and integration updates in tests/benchdnn:
    benchdnn.cpp         — #include + --gated_mlp dispatcher
    parser.cpp           — Help text entry for --gated_mlp
    README.md            — Driver list entry
    doc/driver_gated_mlp.md — Full driver documentation
    

⚠️ Known GPU implementation bug

The GPU ocl:ref:any gated_mlp implementation has a bug: executing multiple primitives with different shapes in the same process produces incorrect results (NaN, garbage) or CL_INVALID_KERNEL_ARGS (errcode -52). Each shape works correctly in isolation. This is a bug in src/gpu/intel/gated_mlp/ref.hpp, not in this driver.

The CI test files are limited to one shape per file as a workaround. Full multi-shape test configurations are included as comments with a TODO to uncomment once the GPU implementation is fixed.


📈 Results

DG2 (Intel Arc A770)

Each test suite is run as a separate process (one shape per invocation) to avoid the GPU sequential execution bug.

Test Suite Total Passed Skipped Failed
test_gated_mlp_smoke (f16 32x32x32) 1 1 0 0
test_gated_mlp_ci (f32 128x256x512) 1 1 0 0
test_gated_mlp_gpu (f16 64x896x4864) 1 1 0 0
Manual: f32 all activations 3 3 0 0
Manual: f16 all activations 3 3 0 0
Manual: bf16 all activations 3 3 0 0
Manual: f16 LLM shapes 4 4 0 0
Manual: bf16 LLM 1024x896x4864 1 1 0 0
Example repro commands (click to open)
# CI test suites
benchdnn --gated_mlp --engine=gpu --batch=inputs/gated_mlp/test_gated_mlp_smoke
benchdnn --gated_mlp --engine=gpu --batch=inputs/gated_mlp/test_gated_mlp_ci
benchdnn --gated_mlp --engine=gpu --batch=inputs/gated_mlp/test_gated_mlp_gpu

# Manual single-shape tests (all pass)
benchdnn --gated_mlp --engine=gpu --dt=f32 --activation=swish 128x256x512
benchdnn --gated_mlp --engine=gpu --dt=f16 --activation=gelu_erf 64x128x256
benchdnn --gated_mlp --engine=gpu --dt=bf16 --activation=gelu_tanh 1024x896x4864

# Performance benchmark
benchdnn --mode=F --gated_mlp --engine=gpu --dt=f16 1024x896x4864

Pull Request Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

New features

  • Have you published an RFC for the new feature? (N/A — benchdnn driver only)
  • Was the RFC approved? (N/A)
  • Have you added relevant tests?
  • Have you added relevant documentation? (doc/driver_gated_mlp.md)

- Introduced Gated MLP driver with functionality for correctness checking and input verification.
- Implemented configuration handling for Gated MLP, including data type and activation function settings.
- Developed reference implementation for Gated MLP using existing oneDNN primitives (matmul, eltwise).
- Added auxiliary functions for activation string conversion and memory management.
- Created test cases for various input shapes, including basic, LLM-scale, and CI configurations.
@github-actions github-actions bot added documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc component:tests Codeowner: @oneapi-src/onednn-arch labels Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:tests Codeowner: @oneapi-src/onednn-arch documentation A request to change/fix/improve the documentation. Codeowner: @oneapi-src/onednn-doc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant