Add "32x1 transposed" variant to MXFP8 3D quantization kernel by alexsamardzic · Pull Request #4356 · pytorch/ao

alexsamardzic · 2026-04-30T09:07:52Z

Stack from ghstack (oldest at bottom):

-> Add "32x1 transposed" variant to MXFP8 3D quantization kernel #4356

[ghstack-poisoned]

pytorch-bot · 2026-04-30T09:07:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4356

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit e84da90 with merge base 6367fd6 ():

NEW FAILURE - The following job has failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
RuntimeError: Command docker exec -t ae7d82c5b32657d5cc5e737c791f7d62fd5ba8c876771cd49ec8d0e2ec10ce11 /exec failed with exit code 2

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test (CUDA 2.10, linux.g5.12xlarge.nvidia.gpu, torch==2.10.0 torchvision==0.25.0, cuda, 12.6) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch torchvision --index-url htt... / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: 9938e1c Pull-Request: #4356

alexsamardzic · 2026-04-30T09:21:03Z

Benchmarking results:

input_shape       scaling_mode                variant      cuda_2d_us    cutedsl_3d_us    to_mx_us    cuda_2d_gbps    cutedsl_3d_gbps    to_mx_gbps
----------------  --------------------------  ---------  ------------  ---------------  ----------  --------------  -----------------  ------------
(1, 8192, 5120)   ScaleCalculationMode.FLOOR  32x1_t          nan               34.816      26.624         nan                3651.76      4775.39
(1, 8192, 5120)   ScaleCalculationMode.FLOOR  32x1_n           30.72            28.736     118.784        4138.67             4424.41      1070.35
(1, 8192, 5120)   ScaleCalculationMode.FLOOR  32x32_n         nan               30.72       73.696         nan                4138.67      1725.19
(1, 8192, 5120)   ScaleCalculationMode.RCEIL  32x1_t          nan               34.816      36.864         nan                3651.76      3448.89
(1, 8192, 5120)   ScaleCalculationMode.RCEIL  32x1_n           28.672           28.672     120.864        4434.29             4434.29      1051.92
(1, 8192, 5120)   ScaleCalculationMode.RCEIL  32x32_n         nan               28.704      73.728         nan                4429.34      1724.44
(1, 7168, 2048)   ScaleCalculationMode.FLOOR  32x1_t          nan               16.384      59.392         nan                2716          749.241
(1, 7168, 2048)   ScaleCalculationMode.FLOOR  32x1_n           16.384           14.336      83.968        2716                3104          529.951
(1, 7168, 2048)   ScaleCalculationMode.FLOOR  32x32_n         nan               14.336      69.632         nan                3104          639.059
(1, 7168, 2048)   ScaleCalculationMode.RCEIL  32x1_t          nan               16.384      63.488         nan                2716          700.903
(1, 7168, 2048)   ScaleCalculationMode.RCEIL  32x1_n           14.368           14.336      86.016        3097.09             3104          517.333
(1, 7168, 2048)   ScaleCalculationMode.RCEIL  32x32_n         nan               14.368      71.648         nan                3097.09       621.077
(8, 8192, 5120)   ScaleCalculationMode.FLOOR  32x1_t          nan              186.368    1208.48          nan                5457.58       841.651
(8, 8192, 5120)   ScaleCalculationMode.FLOOR  32x1_n         1524.74           169.984    1689.63          667.079            5983.61       601.976
(8, 8192, 5120)   ScaleCalculationMode.FLOOR  32x32_n         nan              174.08     1816.58          nan                5842.82       559.91
(8, 8192, 5120)   ScaleCalculationMode.RCEIL  32x1_t          nan              184.288    1289.22          nan                5519.18       788.944
(8, 8192, 5120)   ScaleCalculationMode.RCEIL  32x1_n         1506.27           163.872    1764.35          675.256            6206.79       576.483
(8, 8192, 5120)   ScaleCalculationMode.RCEIL  32x32_n         nan              166.976    1672.19          nan                6091.41       608.255
(8, 7168, 2048)   ScaleCalculationMode.FLOOR  32x1_t          nan               70.656     433.2           nan                5038.38       821.772
(8, 7168, 2048)   ScaleCalculationMode.FLOOR  32x1_n          540.704           65.536     602.112         658.385            5432          591.238
(8, 7168, 2048)   ScaleCalculationMode.FLOOR  32x32_n         nan               67.584     643.072         nan                5267.39       553.58
(8, 7168, 2048)   ScaleCalculationMode.RCEIL  32x1_t          nan               69.664     458.752         nan                5110.12       776
(8, 7168, 2048)   ScaleCalculationMode.RCEIL  32x1_n          534.56            65.536     626.72          665.952            5432          568.023
(8, 7168, 2048)   ScaleCalculationMode.RCEIL  32x32_n         nan               65.536     593.92          nan                5432          599.393
(32, 7168, 2048)  ScaleCalculationMode.FLOOR  32x1_t          nan              252        1686.56          nan                5650.66       844.302
(32, 7168, 2048)  ScaleCalculationMode.FLOOR  32x1_n         2128.9            235.52     2368.54          668.875            6046.05       601.199
(32, 7168, 2048)  ScaleCalculationMode.FLOOR  32x32_n         nan              245.792    2544.61          nan                5793.38       559.601
(32, 7168, 2048)  ScaleCalculationMode.RCEIL  32x1_t          nan              247.84     1807.36          nan                5745.51       787.871
(32, 7168, 2048)  ScaleCalculationMode.RCEIL  32x1_n         2102.24           223.392    2471.97          677.357            6374.29       576.046
(32, 7168, 2048)  ScaleCalculationMode.RCEIL  32x32_n         nan              231.456    2354.19          nan                6152.21       604.864
(32, 8192, 5120)  ScaleCalculationMode.FLOOR  32x1_t          nan              740.912    4806.78          nan                5491.17       846.403
(32, 8192, 5120)  ScaleCalculationMode.FLOOR  32x1_n         6059.01           666.624    6730.75          671.475            6103.1        604.461
(32, 8192, 5120)  ScaleCalculationMode.FLOOR  32x32_n         nan              708.576    7233.54          nan                5741.76       562.446
(32, 8192, 5120)  ScaleCalculationMode.RCEIL  32x1_t          nan              718.848    5125.71          nan                5659.72       793.738
(32, 8192, 5120)  ScaleCalculationMode.RCEIL  32x1_n         5980.18           633.792    7013.41          680.327            6419.26       580.1
(32, 8192, 5120)  ScaleCalculationMode.RCEIL  32x32_n         nan              649.216    6655.12          nan                6266.75       611.33

danielvegamyhre · 2026-04-30T16:00:09Z

@alexsamardzic can you benchmark this against the 2 stage approach we do here in _compute_fwd_sm100():

ao/torchao/prototype/moe_training/mxfp8_grouped_mm.py

Lines 555 to 558 in 9052ece

 weight_e4m3, weight_scales = triton_to_mxfp8_dim0( 

 weight_t.transpose(-2, -1), block_size, scale_calculation_mode.value.lower() 

 ) 

 weight_scales_blocked = triton_mx_block_rearrange_per_group_3d(weight_scales)

alexsamardzic · 2026-04-30T16:43:48Z

can you benchmark this against the 2 stage approach we do here in _compute_fwd_sm100():

ao/torchao/prototype/moe_training/mxfp8_grouped_mm.py

Lines 555 to 558 in 9052ece

weight_e4m3, weight_scales = triton_to_mxfp8_dim0(

weight_t.transpose(-2, -1), block_size, scale_calculation_mode.value.lower()

)

weight_scales_blocked = triton_mx_block_rearrange_per_group_3d(weight_scales)

Here is an adapted benchmarking script to compare between the two: bench_quantize_3d_vs_triton.py.

And here are the results:

input_shape       scaling_mode                  cutedsl_3d_us    triton_two_stage_us  speedup
----------------  --------------------------  ---------------  ---------------------  ---------
(1, 8192, 5120)   ScaleCalculationMode.FLOOR           34.816                 34.848  1.00x
(1, 8192, 5120)   ScaleCalculationMode.RCEIL           32.832                 34.848  1.06x
(1, 7168, 2048)   ScaleCalculationMode.FLOOR           16.224                 20.736  1.28x
(1, 7168, 2048)   ScaleCalculationMode.RCEIL           14.336                 20.352  1.42x
(8, 8192, 5120)   ScaleCalculationMode.FLOOR          186.4                  196.608  1.05x
(8, 8192, 5120)   ScaleCalculationMode.RCEIL          184.384                198.656  1.08x
(8, 7168, 2048)   ScaleCalculationMode.FLOOR           71.68                  72.72   1.01x
(8, 7168, 2048)   ScaleCalculationMode.RCEIL           69.632                 73.728  1.06x
(32, 7168, 2048)  ScaleCalculationMode.FLOOR          261.12                 285.632  1.09x
(32, 7168, 2048)  ScaleCalculationMode.RCEIL          256                    287.712  1.12x
(32, 8192, 5120)  ScaleCalculationMode.FLOOR          759.808               1029.09   1.35x
(32, 8192, 5120)  ScaleCalculationMode.RCEIL          737.312                990.304  1.34x

danielvegamyhre

LGTM with some minor comments/questions

danielvegamyhre · 2026-05-04T21:31:37Z

    x_clone = x.clone().requires_grad_(True)
    w_t_clone = w_t.clone().requires_grad_(True)

-    fn = torch.compile(_to_mxfp8_then_scaled_grouped_mm, fullgraph=True)


why was compile removed here?

Wrong edit, reverted.

danielvegamyhre · 2026-05-04T21:43:39Z

+        input_act = torch.randn(M, K, dtype=torch.bfloat16, device="cuda")
+        weight = torch.randn(num_experts, N, K, dtype=torch.bfloat16, device="cuda")
+        mat2 = weight.transpose(-2, -1)
+        scale_block_k = 1


as an aside, in the future we should probably refactor to not use the "scale_block_n" / "scale_block_k" naming everywhere since that leads to cases like this where we are quantizing pre-transposed input shape (E,K,N) along K, yet setting scale_block_k=1 for api consistency.

maybe in a follow up we can refactor call them scale_block_dim1, scale_block_dim2 or something?

Indeed these are better names, I did the rename now.

danielvegamyhre · 2026-05-04T21:48:09Z

-        * x_scale.unsqueeze(-1).to(torch.bfloat16)
-    ).reshape(M, K)

+    input_scale_ref = input_scale.repeat_interleave(block_size, dim=1)


why is the LHS input activation (scaled with 1x32) being repeat interleaved here? i would think repeat interleave would only be necessary for replicating the scale for a 32x32 weight scaling reference impl?

This should be just dequantization for the BF16 reference, scales are (M, dim//32) so we expand to (M, dim).

danielvegamyhre · 2026-05-04T22:08:21Z

+            if cutlass.const_expr(INPUT_TRANSPOSED_VALUE):
+                staged_layout_in = cute.make_layout(
+                    (STAGE_COUNT_VALUE, 1, TILE_N, TILE_K),
+                    stride=(STAGE_ELEMS, STAGE_ELEMS, 1, TILE_N),


cutedsl question: why is "num stages" instances of TILE_NxTILE_K tiles represented as a 4d tensor of shape (stages, 1, tile_n, tile_k), rather than a 3d tensor of shape (stages, tile_n, tile_k)?

I mechanically used (stages, 1, tile_n, tile_k) to mirror the per-stage TMA tile shape (1, tile_n, tile_k). The singleton dim indeed is not needed at all, so it's removed now.

danielvegamyhre · 2026-05-04T22:11:01Z

-        stride=(tile_n * tile_k, tile_k, 1),
-    )
+def _make_tile_smem_layouts(
+    cute,


why does cute package need to be a param here, is it due to doing the import with guards elsewhere instead of at the top? not a huge deal but this feels a bit awkward

This is a remnant of first try to make code working in case when CuTeDSL package is not installed, replaced with a simple import within the body (also made the change for alike methods for 2D kernels).

[ghstack-poisoned]

ghstack-source-id: 6476ebd Pull-Request: #4356

Update

4cc0095

[ghstack-poisoned]

alexsamardzic requested review from danielvegamyhre, jerryzh168 and vkuzo as code owners April 30, 2026 09:07

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026

alexsamardzic added module: training quantize_ api training flow moe mx labels Apr 30, 2026

Update

04b49da

[ghstack-poisoned]

alexsamardzic added a commit that referenced this pull request Apr 30, 2026

Add "32x1 transposed" variant to MXFP8 3D quantization kernel

ab12244

ghstack-source-id: 9938e1c Pull-Request: #4356

danielvegamyhre added this to the MXFP8 Training milestone Apr 30, 2026

danielvegamyhre assigned alexsamardzic Apr 30, 2026

danielvegamyhre approved these changes May 4, 2026

View reviewed changes

Update

e84da90

[ghstack-poisoned]

alexsamardzic added a commit that referenced this pull request May 6, 2026

Add "32x1 transposed" variant to MXFP8 3D quantization kernel

55c99e9

ghstack-source-id: 6476ebd Pull-Request: #4356

Conversation

alexsamardzic commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4356

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

alexsamardzic commented Apr 30, 2026

Uh oh!

danielvegamyhre commented Apr 30, 2026

Uh oh!

alexsamardzic commented Apr 30, 2026

Uh oh!

danielvegamyhre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexsamardzic commented Apr 30, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading