Skip to content

[Refactor] Phaseout vmap for Tile Operators#1334

Merged
LeiWang1999 merged 20 commits intotile-ai:mainfrom
LeiWang1999:remove_vmap_1125
Nov 26, 2025
Merged

[Refactor] Phaseout vmap for Tile Operators#1334
LeiWang1999 merged 20 commits intotile-ai:mainfrom
LeiWang1999:remove_vmap_1125

Conversation

@LeiWang1999
Copy link
Copy Markdown
Member

@LeiWang1999 LeiWang1999 commented Nov 25, 2025

This pull request refactors several operator constructors in the codebase to simplify their interfaces and improve robustness. The main change is the removal of the BufferMap vmap parameter from constructors for operators such as AtomicAdd, Copy, Fill, Gemm, GemmPy, GemmSP, Conv2DIm2ColOp, and FinalizeReducerOp. Instead, buffer extraction now consistently uses the new NormalizeToBufferRegion utility, which accepts region-compatible expressions (like BufferLoad and BufferRegion) directly. This reduces ambiguity and makes operator construction more straightforward and reliable. Additionally, some legacy or unused code paths have been removed, and debug logging has been added to aid development.

Operator constructor refactoring

  • Removed the BufferMap vmap parameter from constructors for AtomicAdd, Copy, Fill, Gemm, GemmPy, GemmSP, Conv2DIm2ColOp, and FinalizeReducerOp, updating all usages to extract buffers and regions using NormalizeToBufferRegion and accepting only region-compatible expressions. [1] [2] [3] [4] [5] [6] [7] [8]
  • Updated header files to reflect the new constructor signatures without the BufferMap vmap parameter. [1] [2] [3] [4] [5] [6] [7] [8]

Code simplification and robustness

  • Removed legacy and unused code paths in Fill and other operators, enforcing stricter argument validation and error handling for unsupported argument types. [1] [2] [3]
  • Updated buffer extraction logic in operators to consistently use region-compatible expressions, eliminating ambiguous or indirect buffer lookups. [1] [2] [3] [4] [5] [6]

Utility and include changes

  • Replaced includes of region.h with utils.h in all relevant operator source files, reflecting the shift to the new utility functions for region normalization. [1] [2] [3] [4] [5] [6] [7]

Debugging improvements

  • Added additional debug logging to the Copy operator to print region and loop variable information, aiding development and troubleshooting. [1] [2]

These changes make operator construction more reliable, easier to understand, and less error-prone by standardizing buffer and region extraction and removing legacy code paths.

Summary by CodeRabbit

  • Refactor

    • Unified operators and TileLang intrinsics to a single buffer-region argument flow, simplifying region handling and operator parsing.
    • Operator construction and intrinsic argument handling now rely on normalized region inputs for consistent behavior.
  • New Features

    • Exposes an optional mbar/barrier property for GEMM paths and adds region-aware support for several intrinsics.
  • Chores

    • Improved validation/error messages for region cases; minor example and test updates (CUDA test guards, benchmarking flag rename, import formatting).

✏️ Tip: You can customize this high-level summary in your review settings.

… and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.
…izeToBufferRegion to only accept BufferLoad and BufferRegion. This change improves code organization and simplifies the handling of memory regions across various operations.
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 25, 2025

Warning

Rate limit exceeded

@LeiWang1999 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 7 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 099aee6 and 83ff79b.

📒 Files selected for processing (1)
  • src/op/gemm_py.cc (2 hunks)

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Removed BufferMap/vmap from operator constructors and ParseOperator APIs; replaced vmap-based RegionOp lookups with NormalizeToBufferRegion(args[i]) and Python to_buffer_region/get_buffer_region_from_load flows; tl.region repurposed as a lightweight transport; many operators, transforms, intrinsics, and language helpers updated to use BufferRegion normalization.

Changes

Cohort / File(s) Summary
Operator constructors & parsing infra
src/op/operator.h, src/op/operator.cc
Removed BufferMap vmap from ParseOperator overloads and OpBuilderFunc; op builders now accept only Array<PrimExpr> args.
Region normalization & utils
src/op/utils.h, src/op/utils.cc, src/op/region.h, src/op/region.cc
NormalizeToBufferRegion loses vmap param; tl.region becomes lightweight transport with arg-packing (indices/access_mask/extents); tvm_access_ptr path removed; RegionOp registration/ctor updated.
Operator implementations (many)
src/op/*.{h,cc} — e.g. atomic_add.{h,cc}, copy.{h,cc}, fill.{h,cc}, reduce.{h,cc}, finalize_reducer.{h,cc}, gemm*.{h,cc}, copy.h/cc
Constructors now take only Array<PrimExpr> args; buffers/regions derived via NormalizeToBufferRegion(args[i]); new region fields (e.g., srcRegion_, dstRegion_, mbarRegion_, aRegion_, ...) added; removed vmap usage and region.h includes; some scalar members renamed with trailing underscores.
Transforms & lowering
src/transform/*.cclayout_inference.cc, layout_reducer.cc, lower_tile_op.cc
ParseOperator calls updated to omit vmap; buffer resolution extended to consult RegionOp/BufferLoad paths; reducer finalization and T.fill support BufferLoad/RegionOp-derived reducers.
TileLang Python language & utils
tilelang/utils/language.py, tilelang/language/utils.py, tilelang/language/{atomic,copy,fill,reduce,gemm,experimental/gemm_sp}.py
Consolidated region helpers to to_buffer_region / get_buffer_region_from_load; removed older helpers; many language ops now emit/consume tl.region or region-based args.
Intrinsics legalization
tilelang/intrinsics/*_macro_generator.py
Added _legalize_to_buffer_region helpers to convert Buffer/BufferLoad/BufferRegion to BufferRegion; replaced direct to_buffer_region calls in matrix-core emitters.
Examples & tests
examples/*, testing/python/issue/test_tilelang_issue_830.py
Minor example/test updates (bench flag rename, import edits) and added CUDA requirement + device setup in some tests.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as CallSite(args)
  participant Parser as ParseOperator
  participant OpNode as OperatorNode
  participant Normalize as NormalizeToBufferRegion
  participant Region as tl.region / BufferRegion

  rect rgb(240,248,255)
  Note over Caller,Parser: Old flow (vmap-based)
  Caller->>Parser: call(args, vmap)
  Parser->>OpNode: Entry(args, vmap)
  OpNode->>Region: RegionOp(call->args, vmap)
  Region-->>OpNode: BufferRegion (via vmap)
  end

  rect rgb(245,255,240)
  Note over Caller,Parser: New flow (region-normalization)
  Caller->>Parser: call(args)
  Parser->>OpNode: Entry(args)
  OpNode->>Normalize: NormalizeToBufferRegion(arg_i)
  Normalize->>Region: tl.region(...) or return BufferLoad/BufferRegion
  Region-->>OpNode: BufferRegion (buffer + ranges)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Areas to focus review on:
    • NormalizeToBufferRegion semantics and edge cases (Ramp, extents alignment) and its C++ ↔ Python contract.
    • RegionOp arg-packing, Clone semantics, and new registration path.
    • Gemm/GemmPy/GemmSP mbar migration (mbarRegion_/mbar_) and lowering changes.
    • Transform passes: layout_inference, layout_reducer, lower_tile_op for correct buffer collection and reducer finalization.
    • Intrinsics legalization helpers and language helper interactions (to_buffer_region/get_buffer_region_from_load).

Possibly related PRs

Suggested reviewers

  • XuehaiPan
  • chengyupku

Poem

🐇 I hop from map to region bright,
Normalize finds ranges just right.
vmap left behind, I bound and steer,
tl.region carries buffers near.
Hopping on, TileLang sings — hooray, cheer! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.73% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Refactor] Phaseout vmap for Tile Operators' clearly and concisely summarizes the main objective of the changeset—removing the BufferMap vmap parameter from tile operator constructors.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/utils/language.py (1)

162-193: Ramp + extents assertion is inverted, and Buffer+extents lacks length validation.

There are two issues in the new region helpers:

  1. Ramp with extents will always assert (logic vs message mismatch).
if isinstance(indice, tir.Ramp):
    assert extents is None, "extents should be provided for BufferLoad with Ramp indices"
    regions.append(ir.Range.from_min_extent(indice.base, indice.lanes))
    found_ramp = True
  • The condition (extents is None) contradicts the message (“extents should be provided…”).
  • With the new API, callers that pass extents for vectorized loads (indices including tir.Ramp) will hit this assert even though the usage is reasonable.
  • Since indice.lanes already encodes the extent for a Ramp axis, the usual options are:
    • allow extents and ignore extents[i] for that axis, or
    • assert that extents[i] matches indice.lanes.

Consider something along these lines instead:

-    if isinstance(indice, tir.Ramp):
-        assert extents is None, "extents should be provided for BufferLoad with Ramp indices"
-        regions.append(ir.Range.from_min_extent(indice.base, indice.lanes))
-        found_ramp = True
+    if isinstance(indice, tir.Ramp):
+        # Extent is determined by the ramp lanes. If extents are provided,
+        # optionally sanity-check they match.
+        if extents is not None:
+            # Optional: only keep this if you want strict validation.
+            # assert ir.structural_equal(extents[i], indice.lanes), (
+            #     f"Extent for ramp dim {i} must match lanes, "
+            #     f"got {extents[i]} vs {indice.lanes}"
+            # )
+            pass
+        regions.append(ir.Range.from_min_extent(indice.base, indice.lanes))
+        found_ramp = True

This avoids spurious assertion failures while still letting you validate consistency if desired.

  1. to_buffer_region(Buffer, extents=...) should enforce rank/length consistency.
if isinstance(obj, tir.Buffer):
    mins = [tir.IntImm("int32", 0) for _ in obj.shape]
    if extents is None:
        extents = obj.shape
    ranges = [ir.Range.from_min_extent(m, e) for m, e in zip(mins, extents)]
    return tir.BufferRegion(obj, ranges)
  • If extents has a different length than obj.shape, zip silently truncates, yielding a region rank that doesn’t match the buffer rank. This can break downstream assumptions (e.g., code that expects len(region) == len(buf.shape)).
  • It mirrors the length check you already added in get_buffer_region_from_load.

Suggest adding an explicit assertion:

if isinstance(obj, tir.Buffer):
    mins = [tir.IntImm("int32", 0) for _ in obj.shape]
    if extents is None:
        extents = obj.shape
+    assert len(extents) == len(obj.shape), (
+        f"extents (len={len(extents)}) must match buffer rank "
+        f"(len={len(obj.shape)})"
+    )
    ranges = [ir.Range.from_min_extent(m, e) for m, e in zip(mins, extents)]
    return tir.BufferRegion(obj, ranges)

Together, these changes should make the new extents-aware path robust for both scalar and vectorized (Ramp) buffer loads, while preserving the previous fallback behavior when extents is omitted.

Also applies to: 196-224

🧹 Nitpick comments (9)
src/op/copy.h (1)

274-289: Constructors match implementation, but Doxygen still mentions vmap.

The new Copy(Array<PrimExpr> args) and Conv2DIm2ColOp(Array<PrimExpr> args) signatures are consistent with src/op/copy.cc and the vmap removal. However, nearby comments still describe a vmap parameter for both Copy and Conv2DIm2Col, which is now misleading.

Recommend updating the Doxygen blocks to remove vmap from the parameter list and briefly document the expected shapes/semantics of args instead.

Also applies to: 341-347

src/op/utils.cc (1)

15-50: NormalizeToBufferRegion logic looks good; avoid bare throw; for unreachable

The new normalization paths for BufferRegion and BufferLoad are reasonable and the explicit rejection of CallNode arguments matches the PR’s goal of banning region/access_ptr-style arguments here.

One nit: after the final LOG(FATAL), the bare throw; is undefined behavior outside a catch, even if it’s intended as “unreachable” and guarded by a fatal log. It’s safer to drop the throw; and rely on LOG(FATAL) (or replace with an explicit ICHECK(false) / compiler-specific unreachable hint) to satisfy compilers.

Example refactor:

-  LOG(FATAL) << "Unsupported argument for BufferRegion: " << arg;
-  throw; // Unreachable
+  LOG(FATAL) << "Unsupported argument for BufferRegion: " << arg;
+  // Unreachable: LOG(FATAL) terminates the process.
src/op/gemm_sp.cc (1)

14-18: GemmSP constructor correctly migrated to NormalizeToBufferRegion, but doc is stale

The constructor now resolves a_, e_, b_, and c_ via NormalizeToBufferRegion(args[i])->buffer, which is consistent with the new utils API and vmap-free design. Including "utils.h" is appropriate for this usage.

However, the doc comment above still talks about:

  • taking a BufferMap vmap parameter, and
  • looking up A/E/B/C “in vmap”.

This no longer reflects the code and could confuse future readers.

Consider updating the comment to describe the new behavior (args[0..3] must be BufferRegion/BufferLoad-compatible PrimExprs normalized via NormalizeToBufferRegion, no vmap parameter).

Also applies to: 66-110

tilelang/language/fill.py (1)

5-7: fill correctly delegates region normalization to to_buffer_region

Normalizing any Var-bound buffer object first, then calling to_buffer_region(buffer) before emitting tl.fill aligns this helper with the new shared region pipeline and removes ad-hoc conversion logic.

clear still works and can later be simplified to use to_buffer_region directly, but there’s no functional issue as-is.

Also applies to: 19-24

tilelang/language/utils.py (1)

7-48: Consider removing the unused access_type parameter.

The refactored functions now construct BufferRegion objects directly without tracking access masks, making the access_type parameter unused in all four functions (region, buffer_to_tile_region, buffer_load_to_tile_region, buffer_region_to_tile_region). If this parameter is not required for API compatibility with existing code, consider removing it to simplify the interface.

Otherwise, if the parameter must be retained for backward compatibility, the current implementation is correct.

Based on learnings, static analysis correctly identified the unused parameter.

src/op/fill.cc (1)

27-62: Docstring is out of sync with the implementation.

The docstring still references:

  1. vmap parameter (lines 55-56) which was removed from the constructor signature
  2. "Access pointer" handling and resolution via vmap[GetVarFromAccessPtr(args[0])] (lines 39-42) which is no longer supported

The docstring should be updated to reflect the new behavior that only accepts BufferRegion or BufferLoad as args[0].

src/op/atomic_add.cc (1)

25-43: Docstring references non-existent vmap parameter.

Line 38 states @param vmap Unused; kept for compatibility. but the vmap parameter has been removed from the constructor signature at line 44. This documentation should be removed.

- * @param vmap Unused; kept for compatibility.
- *
src/op/reduce.cc (1)

496-518: Remove commented-out dead code.

Lines 504-505 contain commented-out code from the old implementation that should be removed to keep the codebase clean.

   CHECK_EQ(args.size(), 4);
   ObjectPtr<CumSumOpNode> node = tvm::ffi::make_object<CumSumOpNode>();
-  // node->src = vmap[GetVarFromAccessPtr(args[0])];
-  // node->dst = vmap[GetVarFromAccessPtr(args[1])];
   node->srcRegion_ = NormalizeToBufferRegion(args[0]);
   node->dstRegion_ = NormalizeToBufferRegion(args[1]);
src/op/copy.cc (1)

110-131: Docstring references non-existent vmap parameter.

Line 130 states @param vmap Unused; kept for compatibility. but the vmap parameter has been removed from the constructor signature. This should be removed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f34840 and 426ab58.

📒 Files selected for processing (31)
  • src/op/atomic_add.cc (3 hunks)
  • src/op/atomic_add.h (1 hunks)
  • src/op/copy.cc (5 hunks)
  • src/op/copy.h (2 hunks)
  • src/op/fill.cc (3 hunks)
  • src/op/fill.h (1 hunks)
  • src/op/finalize_reducer.cc (2 hunks)
  • src/op/finalize_reducer.h (1 hunks)
  • src/op/gemm.cc (2 hunks)
  • src/op/gemm.h (1 hunks)
  • src/op/gemm_py.cc (2 hunks)
  • src/op/gemm_py.h (1 hunks)
  • src/op/gemm_sp.cc (2 hunks)
  • src/op/gemm_sp.h (1 hunks)
  • src/op/operator.cc (2 hunks)
  • src/op/operator.h (2 hunks)
  • src/op/reduce.cc (3 hunks)
  • src/op/reduce.h (2 hunks)
  • src/op/region.cc (0 hunks)
  • src/op/region.h (0 hunks)
  • src/op/utils.cc (2 hunks)
  • src/op/utils.h (1 hunks)
  • src/transform/layout_inference.cc (2 hunks)
  • src/transform/layout_reducer.cc (2 hunks)
  • src/transform/lower_tile_op.cc (1 hunks)
  • tilelang/ir.py (0 hunks)
  • tilelang/language/atomic.py (2 hunks)
  • tilelang/language/copy.py (2 hunks)
  • tilelang/language/fill.py (2 hunks)
  • tilelang/language/utils.py (1 hunks)
  • tilelang/utils/language.py (3 hunks)
💤 Files with no reviewable changes (3)
  • tilelang/ir.py
  • src/op/region.h
  • src/op/region.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/lower_tile_op.cc
🧬 Code graph analysis (25)
src/transform/layout_reducer.cc (2)
src/transform/layout_inference.cc (16)
  • op (43-49)
  • op (43-43)
  • op (433-494)
  • op (433-433)
  • op (531-556)
  • op (531-531)
  • op (558-609)
  • op (558-558)
  • op (611-620)
  • op (611-611)
  • op (622-650)
  • op (622-622)
  • op (652-680)
  • op (652-652)
  • op (890-902)
  • op (890-890)
src/op/operator.cc (2)
  • GetVarFromAccessPtr (74-81)
  • GetVarFromAccessPtr (74-74)
src/op/gemm.h (2)
src/op/gemm.cc (1)
  • Gemm (55-94)
tilelang/ir.py (1)
  • Gemm (43-44)
src/op/copy.h (2)
src/op/copy.cc (2)
  • Copy (132-156)
  • Conv2DIm2ColOp (1734-1747)
tilelang/ir.py (2)
  • Copy (20-21)
  • Conv2DIm2ColOp (25-26)
src/op/reduce.h (2)
src/op/reduce.cc (2)
  • ReduceOp (30-42)
  • CumSumOp (496-518)
tilelang/ir.py (2)
  • ReduceOp (63-64)
  • CumSumOp (68-69)
src/transform/lower_tile_op.cc (1)
src/op/operator.cc (4)
  • ParseOperator (32-41)
  • ParseOperator (32-32)
  • ParseOperator (55-61)
  • ParseOperator (55-55)
src/op/fill.h (2)
src/op/fill.cc (1)
  • Fill (63-121)
tilelang/ir.py (1)
  • Fill (10-11)
src/op/finalize_reducer.cc (1)
tilelang/ir.py (1)
  • FinalizeReducerOp (53-54)
src/op/gemm_sp.h (2)
src/op/gemm_sp.cc (1)
  • GemmSP (87-110)
tilelang/ir.py (1)
  • GemmSP (48-49)
src/op/operator.h (1)
src/op/operator.cc (4)
  • ParseOperator (32-41)
  • ParseOperator (32-32)
  • ParseOperator (55-61)
  • ParseOperator (55-55)
tilelang/language/utils.py (1)
tilelang/utils/language.py (1)
  • to_buffer_region (196-224)
src/op/gemm.cc (3)
src/op/gemm.h (1)
  • Gemm (144-149)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-50)
  • NormalizeToBufferRegion (15-15)
src/transform/inject_ptx_async_copy.cc (2)
  • load (54-181)
  • load (54-56)
src/op/reduce.cc (3)
src/op/reduce.h (2)
  • ReduceOp (124-130)
  • CumSumOp (162-168)
tilelang/ir.py (2)
  • ReduceOp (63-64)
  • CumSumOp (68-69)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-50)
  • NormalizeToBufferRegion (15-15)
src/op/gemm_py.h (2)
src/op/gemm_py.cc (1)
  • GemmPy (55-94)
tilelang/tileop/gemm/__init__.py (1)
  • GemmPy (56-192)
src/op/atomic_add.cc (3)
tilelang/ir.py (1)
  • AtomicAdd (15-16)
tilelang/language/utils.py (1)
  • region (7-17)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-50)
  • NormalizeToBufferRegion (15-15)
src/op/fill.cc (3)
src/op/fill.h (1)
  • Fill (45-50)
tilelang/ir.py (1)
  • Fill (10-11)
tilelang/language/utils.py (1)
  • region (7-17)
tilelang/utils/language.py (2)
tilelang/language/ast/ir.py (1)
  • Range (1716-1728)
tilelang/language/utils.py (1)
  • region (7-17)
tilelang/language/fill.py (2)
tilelang/utils/language.py (1)
  • to_buffer_region (196-224)
tilelang/language/tir/op.py (1)
  • call_intrin (120-145)
tilelang/language/atomic.py (1)
tilelang/utils/language.py (2)
  • to_buffer_region (196-224)
  • legalize_pairwise_extents (394-437)
src/op/gemm_sp.cc (2)
tilelang/ir.py (1)
  • GemmSP (48-49)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-50)
  • NormalizeToBufferRegion (15-15)
src/op/finalize_reducer.h (2)
src/op/finalize_reducer.cc (1)
  • FinalizeReducerOp (36-47)
tilelang/ir.py (1)
  • FinalizeReducerOp (53-54)
src/op/utils.h (1)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-50)
  • NormalizeToBufferRegion (15-15)
src/op/operator.cc (1)
src/transform/layout_inference.cc (16)
  • op (43-49)
  • op (43-43)
  • op (433-494)
  • op (433-433)
  • op (531-556)
  • op (531-531)
  • op (558-609)
  • op (558-558)
  • op (611-620)
  • op (611-611)
  • op (622-650)
  • op (622-622)
  • op (652-680)
  • op (652-652)
  • op (890-902)
  • op (890-890)
src/transform/layout_inference.cc (1)
src/op/operator.cc (4)
  • ParseOperator (32-41)
  • ParseOperator (32-32)
  • ParseOperator (55-61)
  • ParseOperator (55-55)
src/op/gemm_py.cc (1)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-50)
  • NormalizeToBufferRegion (15-15)
tilelang/language/copy.py (1)
tilelang/utils/language.py (1)
  • to_buffer_region (196-224)
🪛 Ruff (0.14.5)
tilelang/language/utils.py

7-7: Unused function argument: access_type

(ARG001)


20-20: Unused function argument: access_type

(ARG001)


25-25: Unused function argument: access_type

(ARG001)


35-35: Unused function argument: access_type

(ARG001)

tilelang/utils/language.py

189-189: Prefer TypeError exception for invalid type

(TRY004)


189-189: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Quick Lint
🔇 Additional comments (26)
src/op/gemm.h (1)

144-149: Gemm constructor header now matches implementation (vmap removed).

Gemm(Array<PrimExpr> args) aligns with the implementation in src/op/gemm.cc and the broader vmap removal; no issues at the header level.

src/op/reduce.h (1)

124-130: ReduceOp and CumSumOp constructor signatures are consistent with new region flow.

Both wrappers now take only Array<PrimExpr> args, matching src/op/reduce.cc and the NormalizeToBufferRegion-based construction; header surface is coherent with the new design.

Also applies to: 162-168

src/op/utils.h (1)

18-22: NormalizeToBufferRegion declaration matches implementation and clarified contract.

Header now correctly exposes NormalizeToBufferRegion(const PrimExpr &arg) and documents that only BufferRegion/BufferLoad-style arguments are supported (no tvm_access_ptr), which is consistent with src/op/utils.cc.

src/op/finalize_reducer.h (1)

47-53: FinalizeReducerOp constructor updated correctly to arg-only form.

The wrapper now exposes FinalizeReducerOp(Array<PrimExpr> args), matching src/op/finalize_reducer.cc and the BufferLoad-based reducer resolution; no header-level issues.

src/op/fill.h (1)

44-50: Fill wrapper constructor correctly updated to arg-only form.

Fill(Array<PrimExpr> args) matches the implementation in src/op/fill.cc and the new BufferRegion/BufferLoad-based destination handling; no further header changes needed.

src/op/atomic_add.h (1)

63-70: AtomicAdd constructor header aligns with refactored implementation.

AtomicAdd(Array<PrimExpr> args) is consistent with the updated implementation and NormalizeToBufferRegion-based region extraction; the public API surface looks correct.

src/op/gemm_py.h (1)

82-87: GemmPy constructor signature matches refactor direction

Header now declares GemmPy(Array<PrimExpr> args) in line with the updated implementation and the removal of vmap-based resolution. No issues from the header side.

src/transform/lower_tile_op.cc (1)

603-645: Evaluate visitor correctly updated to new ParseOperator API

Switching to ParseOperator(tvm::ffi::GetRef<Stmt>(op)) aligns this pass with the new vmap-free ParseOperator signature. Existing uses of buffer_data_to_buffer_ (e.g., Var visitor) remain intact, so there’s no functional gap here.

src/op/gemm_sp.h (1)

96-101: GemmSP header updated consistently with new constructor

The GemmSP(Array<PrimExpr> args) declaration matches the implementation and removal of vmap from the constructor. Header side is consistent and ready.

tilelang/language/atomic.py (1)

7-10: atomic_add now uses shared to_buffer_region path for tile-region ops

Routing both value and dst through to_buffer_region(..., *_extent) after legalize_pairwise_extents unifies region handling with the rest of the language utilities and keeps the scalar/extern path (dst_extent is None and src_extent is None) intact.

Broadcast and extent logic reuse legalize_pairwise_extents, matching the copy semantics, and return_prev/memory_order behaviors are preserved. This looks consistent and safe.

Also applies to: 164-218

tilelang/language/copy.py (1)

5-9: copy now relies on to_buffer_region for src/dst regions

After inferring and legalizing src_extent/dst_extent, converting both operands via to_buffer_region(src, src_extent) and to_buffer_region(dst, dst_extent) removes bespoke region logic and reuses the shared utility. The BufferLoad→BufferStore fast path remains intact, and broadcasting semantics are unchanged.

This is a clean alignment with the new region-handling design.

Also applies to: 72-83

src/transform/layout_inference.cc (2)

439-439: LGTM: ParseOperator call updated correctly.

The call to ParseOperator now uses only the Call reference, consistent with the new signature that removes the BufferMap vmap parameter.


497-499: LGTM: Simplified buffer extraction from BufferLoad.

The prioritization of BufferLoadNode for direct buffer extraction is correct and aligns with the PR's goal to remove vmap-based indirect lookups.

src/op/operator.cc (2)

32-40: LGTM: ParseOperator signature and builder invocation updated correctly.

The function now invokes the builder with only call->args, consistent with the updated OpBuilderFunc type that no longer accepts a BufferMap parameter.


55-60: LGTM: Statement-based ParseOperator updated correctly.

The delegation to the Call-based overload is consistent with the removal of the vmap parameter.

src/transform/layout_reducer.cc (2)

279-299: LGTM: Dual-path reducer tracking added.

The addition of BufferLoadNode handling alongside the legacy tvm_access_ptr path correctly supports the transition to direct BufferLoad-based region handling while maintaining backward compatibility.


303-314: LGTM: FinalizeReducerOp handling correctly updated.

The var derivation logic correctly handles both BufferLoadNode and access_ptr cases, ensuring proper pairing with the T.fill tracking logic above.

src/op/gemm_py.cc (1)

55-94: LGTM: GemmPy constructor correctly refactored.

The constructor now uses NormalizeToBufferRegion directly on arguments without requiring a BufferMap, and correctly handles the optional mbar buffer via BufferLoadNode detection.

src/op/finalize_reducer.cc (1)

36-47: LGTM: FinalizeReducerOp constructor correctly updated.

The constructor now derives the reducer buffer directly from a BufferLoadNode, with a clear error message for unsupported argument types. This aligns with the broader shift away from vmap-based resolution.

src/op/gemm.cc (1)

55-94: LGTM: Gemm constructor correctly refactored.

The constructor follows the same correct pattern as GemmPy, using NormalizeToBufferRegion directly without requiring a BufferMap, and properly handling the optional mbar buffer.

src/op/operator.h (1)

75-88: LGTM: Public API signatures correctly updated.

The header declarations and type aliases are consistent with the implementation changes, properly removing the BufferMap parameter from the operator parsing and building interfaces.

src/op/fill.cc (1)

63-92: LGTM!

The constructor correctly handles the three cases (BufferRegion, BufferLoad, and unsupported) and properly extracts buffer/region information using the new simplified interface without vmap.

src/op/atomic_add.cc (1)

44-66: LGTM!

The constructor correctly uses NormalizeToBufferRegion to extract buffer regions from the first two arguments and properly handles the optional parameters for TMA, memory order, and coalesced width.

src/op/reduce.cc (1)

30-42: LGTM!

The ReduceOp constructor correctly uses NormalizeToBufferRegion to extract buffer regions and properly parses the remaining arguments for reduce type, dimension, and clear flag.

src/op/copy.cc (2)

132-156: LGTM!

The Copy constructor correctly uses NormalizeToBufferRegion to extract buffer regions from the first two arguments and properly handles the optional parameters for coalesced width, disable TMA, and eviction policy.


1734-1747: LGTM!

The Conv2DIm2ColOp constructor correctly uses NormalizeToBufferRegion to extract buffers from the first two arguments, aligning with the broader refactor pattern in this PR.

Comment thread src/op/copy.cc Outdated
Comment thread src/op/copy.cc Outdated
@SiriusNEO SiriusNEO self-requested a review November 25, 2025 07:25
Comment thread src/op/atomic_add.cc Outdated
* - args[1] is the destination region call,
* - args[2] (optional) is an IntImm specifying coalesced width.
* @param vmap Mapping from buffers used by RegionOp to concrete Buffer objects.
* @param vmap Unused; kept for compatibility.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't keep it?

Comment thread src/op/copy.cc Outdated
Comment thread src/op/gemm_sp.cc Outdated
Comment thread src/op/utils.h
Comment thread tilelang/language/utils.py Outdated
Comment thread tilelang/language/utils.py Outdated
…ss various operations, including GEMM and fill functions. This change enhances the consistency of region management and improves code organization by utilizing utility functions for buffer region conversions.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/gemv/example_gemv.py (1)

337-359: Keyword argument mismatch (bench_ref vs do_bench) will crash at runtime

check_correctness_and_bench is defined with parameter bench_ref (line 337), but all six call sites pass do_bench=do_bench (lines 353–358). Python raises TypeError: check_correctness_and_bench() got an unexpected keyword argument 'do_bench', preventing the example from running.

The fix aligns the function signature and logic to match the new call sites:

-def check_correctness_and_bench(kernel, N, K, bench_ref=True):
-    profiler = kernel.get_profiler()
-    profiler.assert_allclose(lambda x, y: x @ y.T, atol=1e-2, rtol=1e-2)
-    if bench_ref:
-        latency = profiler.do_bench(lambda x, y: x @ y.T, warmup=50)
-        print(f"Torch Latency: {latency} ms")
-    latency = profiler.do_bench(kernel, warmup=50)
-    print(f"TileLang Latency: {latency} ms\n")
+def check_correctness_and_bench(kernel, N, K, do_bench: bool = True):
+    profiler = kernel.get_profiler()
+    # Always check numerical correctness
+    profiler.assert_allclose(lambda x, y: x @ y.T, atol=1e-2, rtol=1e-2)
+    if not do_bench:
+        return
+    latency = profiler.do_bench(lambda x, y: x @ y.T, warmup=50)
+    print(f"Torch Latency: {latency} ms")
+    latency = profiler.do_bench(kernel, warmup=50)
+    print(f"TileLang Latency: {latency} ms\n")
🧹 Nitpick comments (10)
examples/analyze/test_example_analyze.py (1)

13-15: Direct main-guard call is fine; consider whether both tests should run on manual execution

Switching from tilelang.testing.main() to calling test_example_gemm_analyze() directly is consistent with the PR’s pattern and keeps pytest behavior unchanged. If the intent when running this file as a script is to exercise both GEMM and CONV analyses, consider also calling test_example_conv_analyze() and/or removing the now-commented tilelang.testing.main() to avoid ambiguity.

examples/sparse_tensorcore/test_example_sparse_tensorcore.py (1)

12-14: Entry-point change is consistent; consider dropping the commented test runner

Directly calling test_tilelang_example_sparse_tensorcore() under __main__ is straightforward and still respects the CUDA-gating decorators. Once you’re confident you won’t revert to the old pattern, consider removing the commented tilelang.testing.main() line to keep the entry point uncluttered.

examples/deepseek_mla/test_example_mla_decode.py (1)

12-14: Direct call under __main__ is fine; optional cleanup of commented runner

The switch to calling test_example_mla_decode() directly under __main__ matches the rest of the PR and preserves pytest behavior. You might later remove the commented tilelang.testing.main() once the new pattern is fully adopted to avoid leaving dead code in the entry point.

tilelang/language/utils.py (1)

7-42: Address past review concern: relationship with to_buffer_region.

The past review comments ask why to_buffer_region isn't used here. These utility functions (buffer_to_tile_region, buffer_load_to_tile_region, buffer_region_to_tile_region) convert buffer-like objects to tl.region calls, which overlaps with to_buffer_region in tilelang/utils/language.py (lines 195-236).

Consider:

  • Consolidating these utilities into a single module to avoid duplication
  • Documenting the distinction if both are needed (e.g., access_type handling differences)
  • Deprecating one set if they're functionally equivalent

This will improve maintainability and reduce confusion about which utility to use.

Based on past review comments indicating this was already flagged.

tilelang/language/copy.py (1)

6-11: Clarify the distinction between region utility modules.

The file imports from two different utility modules:

  • to_buffer_region from tilelang.utils.language (line 6)
  • buffer_to_tile_region from tilelang.language.utils (line 11)

Both appear to handle buffer-to-region conversions. This dual import pattern (also seen in tilelang/language/utils.py review) suggests:

  1. Potential confusion: Developers may not know which utility to use when
  2. Possible duplication: The modules may have overlapping functionality

Consider consolidating these utilities or clearly documenting when to use each one.

src/op/finalize_reducer.cc (1)

22-35: Stale documentation: remove vmap references.

The docstring still describes vmap as a parameter (lines 26, 33-34), but the constructor signature was changed to only take Array<PrimExpr> args. The documentation should be updated to reflect the new signature and describe that the reducer is now extracted via NormalizeToBufferRegion.

 /**
  * @brief Construct a FinalizeReducerOp from TL operator arguments and a buffer
  * map.
  *
- * Extracts the reducer Buffer from `vmap` using the variable referenced by
- * `args[0]` and sets the reduction operation type from the integer code in
+ * Extracts the reducer Buffer from `args[0]` via NormalizeToBufferRegion
+ * and sets the reduction operation type from the integer code in
  * `args[1]`.
  *
  * @param args TL operator arguments: expects at least two elements where
- *             `args[0]` is an access pointer identifying the reducer variable
+ *             `args[0]` is a region-compatible expression (BufferLoad/BufferRegion/tl.region)
  * and `args[1]` is an integer encoding a `ReducerOpType` (e.g., Sum/Max/Min).
- * @param vmap Mapping from variables to Buffers used to look up the reducer
- * Buffer.
  */
src/op/fill.cc (1)

27-62: Stale documentation: update to reflect simplified signature.

The docstring describes behavior for multiple input cases (BufferLoad vs access pointer paths) and references the vmap parameter (lines 40-42, 55-56) which no longer exists. The documentation should be simplified to reflect that all inputs are now normalized via NormalizeToBufferRegion.

src/op/copy.cc (1)

130-131: Remove stale vmap reference from documentation.

Line 130 states "@param vmap Unused; kept for compatibility" but the constructor signature at line 132 shows Copy::Copy(Array<PrimExpr> args) with no vmap parameter. This comment should be removed.

- * @param vmap Unused; kept for compatibility.
  */
 Copy::Copy(Array<PrimExpr> args) {
tilelang/utils/language.py (2)

162-193: Logic issue: inconsistent behavior when extents is not None and Ramp indices are present.

At line 179, an assertion requires extents is None when encountering a Ramp index. However, this assertion occurs inside the loop after some indices may have already been processed with extents. If a mixed case occurs (e.g., first index is PrimExpr, second is Ramp), the function will partially build regions before failing.

Consider either:

  1. Validating all indices upfront before processing, or
  2. Clarifying in the docstring that mixed Ramp + extents is unsupported

Also, the static analysis hint suggests using TypeError at line 189 since the error is about an unsupported type:

         else:
-            raise ValueError(f"Unsupported type: {type(indice)} for index {i}")
+            raise TypeError(f"Unsupported index type: {type(indice)} at position {i}")

196-237: Approve with verification: to_buffer_region logic changes look reasonable.

The changes add:

  1. Let-binding resolution for tir.Var (lines 204-206)
  2. Extents-based region construction using _make_region_call

The logic correctly handles the different input types and the fallback behaviors. However, a few observations:

  1. Line 213: The assertion len(extents) == len(exts) will fail with an unclear message. Consider adding a descriptive error message.

  2. Lines 233-234: When padding extents with leading 1s for mismatched lengths, this assumes the caller intended right-alignment. This is a reasonable convention but could benefit from a brief inline comment.

  3. Line 237: Per static analysis, consider using TypeError instead of ValueError since this is a type mismatch:

-    raise ValueError(f"Unsupported argument type for to_buffer_region: {type(obj)}")
+    raise TypeError(f"Unsupported argument type for to_buffer_region: {type(obj)}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 426ab58 and a0a36d9.

📒 Files selected for processing (23)
  • examples/analyze/test_example_analyze.py (1 hunks)
  • examples/deepseek_mla/test_example_mla_decode.py (1 hunks)
  • examples/deepseek_v32/test_tilelang_example_deepseek_v32.py (1 hunks)
  • examples/gemv/example_gemv.py (1 hunks)
  • examples/gemv/test_example_gemv.py (0 hunks)
  • examples/sparse_tensorcore/test_example_sparse_tensorcore.py (1 hunks)
  • src/op/copy.cc (4 hunks)
  • src/op/fill.cc (2 hunks)
  • src/op/finalize_reducer.cc (2 hunks)
  • src/op/region.cc (3 hunks)
  • src/op/region.h (3 hunks)
  • src/op/utils.cc (2 hunks)
  • src/op/utils.h (1 hunks)
  • src/transform/layout_inference.cc (3 hunks)
  • src/transform/layout_reducer.cc (2 hunks)
  • testing/python/issue/test_tilelang_issue_830.py (1 hunks)
  • tilelang/language/copy.py (3 hunks)
  • tilelang/language/experimental/gemm_sp.py (2 hunks)
  • tilelang/language/fill.py (2 hunks)
  • tilelang/language/gemm.py (3 hunks)
  • tilelang/language/reduce.py (1 hunks)
  • tilelang/language/utils.py (1 hunks)
  • tilelang/utils/language.py (3 hunks)
💤 Files with no reviewable changes (1)
  • examples/gemv/test_example_gemv.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/op/utils.h
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/issue/test_tilelang_issue_830.py
  • examples/sparse_tensorcore/test_example_sparse_tensorcore.py
🧬 Code graph analysis (16)
src/op/copy.cc (1)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-53)
  • NormalizeToBufferRegion (15-15)
tilelang/language/experimental/gemm_sp.py (2)
tilelang/language/utils.py (1)
  • buffer_to_tile_region (13-17)
tilelang/language/tir/op.py (1)
  • call_intrin (120-145)
src/op/fill.cc (6)
tilelang/ir.py (1)
  • Fill (10-11)
src/op/fill.h (1)
  • Fill (45-50)
src/runtime/runtime.cc (4)
  • args (47-77)
  • args (47-47)
  • args (131-166)
  • args (131-131)
tilelang/language/utils.py (1)
  • region (7-10)
src/transform/pipeline_planning.cc (4)
  • region (470-477)
  • region (470-470)
  • region (480-487)
  • region (480-480)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-53)
  • NormalizeToBufferRegion (15-15)
tilelang/language/utils.py (2)
tilelang/language/tir/op.py (1)
  • call_intrin (120-145)
tilelang/language/ast/ir.py (2)
  • buffer (93-161)
  • min (1578-1594)
src/op/finalize_reducer.cc (4)
tilelang/ir.py (1)
  • FinalizeReducerOp (53-54)
tilelang/language/utils.py (1)
  • region (7-10)
src/transform/pipeline_planning.cc (4)
  • region (470-477)
  • region (470-470)
  • region (480-487)
  • region (480-480)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-53)
  • NormalizeToBufferRegion (15-15)
tilelang/utils/language.py (2)
tilelang/language/utils.py (1)
  • region (7-10)
tilelang/language/frame.py (2)
  • has_let_value (188-197)
  • get_let_value (200-209)
tilelang/language/fill.py (2)
tilelang/utils/language.py (2)
  • get_buffer_region_from_load (162-193)
  • to_buffer_region (196-237)
tilelang/language/utils.py (1)
  • region (7-10)
tilelang/language/reduce.py (1)
tilelang/language/utils.py (1)
  • buffer_to_tile_region (13-17)
src/op/utils.cc (2)
src/op/region.cc (2)
  • Get (78-81)
  • Get (78-78)
tilelang/language/utils.py (1)
  • region (7-10)
tilelang/language/copy.py (2)
tilelang/utils/language.py (2)
  • to_buffer_region (196-237)
  • get_buffer_region_from_load (162-193)
tilelang/language/utils.py (1)
  • buffer_to_tile_region (13-17)
src/transform/layout_reducer.cc (2)
src/op/region.cc (2)
  • Get (78-81)
  • Get (78-78)
src/op/operator.cc (2)
  • GetVarFromAccessPtr (74-81)
  • GetVarFromAccessPtr (74-74)
src/op/region.h (1)
src/op/region.cc (5)
  • Lower (70-72)
  • Lower (70-70)
  • InferLayout (74-76)
  • InferLayout (74-74)
  • RegionOp (21-53)
tilelang/language/gemm.py (1)
tilelang/language/utils.py (2)
  • buffer_region_to_tile_region (30-42)
  • region (7-10)
src/transform/layout_inference.cc (1)
src/op/region.cc (2)
  • Get (78-81)
  • Get (78-78)
examples/gemv/example_gemv.py (1)
tilelang/profiler/__init__.py (1)
  • do_bench (217-279)
src/op/region.cc (2)
tilelang/ir.py (1)
  • RegionOp (73-74)
tilelang/layout/layout.py (1)
  • index (46-55)
🪛 Ruff (0.14.5)
tilelang/utils/language.py

189-189: Prefer TypeError exception for invalid type

(TRY004)


189-189: Avoid specifying long messages outside the exception class

(TRY003)


237-237: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (25)
examples/deepseek_v32/test_tilelang_example_deepseek_v32.py (1)

43-45: Manual entry now runs only the sparse MLA backward test

Calling test_example_sparse_mla_bwd() directly under __main__ is coherent with the new pattern and keeps pytest-based discovery intact. Just be aware that running this file as a script will no longer exercise the forward/pipelined variants unless explicitly invoked or run via the test runner.

src/op/region.h (1)

1-82: Excellent documentation of the design rationale.

The extensive comments (lines 5-24) clearly explain why tl.region is needed as a transport mechanism instead of passing BufferRegion directly through FFI. The explanation of the Ramp/lanes limitation and the need to encode dynamic extents is valuable for maintainers.

The updated constructor signature (line 80) and metadata exposure via RegisterReflection (lines 62-68) align well with the vmap removal refactor.

tilelang/language/reduce.py (1)

326-326: LGTM: Clean migration to region-based API.

The change from reducer.access_ptr("w") to buffer_to_tile_region(reducer, "w") is consistent with the PR's objective to standardize buffer/region handling. The "w" (write) access type is correctly preserved.

tilelang/language/experimental/gemm_sp.py (1)

66-77: LGTM: Proper region-based argument construction.

The refactor correctly:

  • Creates region descriptors using buffer_to_tile_region with appropriate access types (lines 67-70)
  • Passes these descriptors to the intrinsic instead of raw pointers (lines 74-77)
  • Uses "r" (read) for inputs A_sparse, E, B and "rw" (read-write) for accumulator C

This aligns with the vmap removal objectives and maintains correct access semantics.

tilelang/language/copy.py (2)

74-75: LGTM: Centralized region conversion logic.

Replacing the inline _to_region helper with the centralized to_buffer_region utility improves maintainability. The function correctly accepts both the buffer-like object and the legalized extents, ensuring consistent region construction across source and destination.


115-118: LGTM: Consistent region descriptor construction for im2col.

The refactor correctly uses buffer_to_tile_region to create region descriptors for img (read) and col (write), replacing the previous region handling. The access types are appropriately specified.

tilelang/language/fill.py (2)

23-36: Improved clarity through delegation to centralized utility.

The refactor successfully simplifies the fill implementation by delegating region construction to to_buffer_region. The extents-gathering logic cleanly handles Buffer, BufferRegion, and BufferLoad cases.


23-36: Remove the review comment - the concern about silent failures is factually incorrect.

The code does not silently fail for unsupported buffer types. When an unsupported buffer type is passed:

  1. The else clause sets extents = [] (line 34)
  2. This is passed to to_buffer_region(buffer, [])
  3. to_buffer_region raises an explicit ValueError with message "Unsupported argument type for to_buffer_region: {type(obj)}" (line 237 in tilelang/utils/language.py)

The empty extents fallback only affects truly unsupported types—not the normal code path. All supported buffer types (Buffer, BufferRegion, BufferLoad) set non-empty extents values directly. The error handling is explicit, not silent.

Likely an incorrect or invalid review comment.

src/transform/layout_inference.cc (2)

440-447: LGTM: Proper integration of region-based buffer resolution.

The refactor correctly:

  1. Removes the GetBufferMap() argument from ParseOperator (line 440), aligning with vmap removal
  2. Adds a fallback to getBufferFromRegion (lines 445-446) to handle RegionOp-wrapped buffers
  3. Ensures both traditional access_ptr and new region-based buffers are collected in the use-list

This maintains buffer tracking completeness during the transition to region-based handling.


500-536: Well-structured buffer extraction from multiple representations.

The updated buffer extraction logic:

  • Lines 500-502: Short-circuits to return BufferLoad's buffer directly, avoiding unnecessary lookups
  • Lines 526-536: New getBufferFromRegion method cleanly extracts buffers from tl.region calls by unwrapping the BufferLoad argument

This dual-path approach (access_ptr and region) ensures backward compatibility while supporting the new region-based operators.

src/op/utils.cc (1)

15-53: All breaking changes have been properly updated.

The verification confirms that:

  1. Signature migration is complete: All 14 call sites across the operator implementations (finalize_reducer, fill, reduce, gemm, copy, atomic_add, gemm_py, gemm_sp) invoke NormalizeToBufferRegion with only the single arg parameter—no lingering multi-parameter calls with vmap.

  2. Access pointer handling migration is complete: The old vmap-based tvm_access_ptr extraction path has been removed (evident from commented-out code in reduce.cc using GetVarFromAccessPtr). The only remaining tvm_access_ptr references are in BuildAccessPtrFromRegion(), which is the inverse operation—correctly building access pointers from regions rather than parsing them.

The header documentation explicitly notes that tvm_access_ptr is no longer supported in NormalizeToBufferRegion, confirming the intentional removal of that code path.

src/transform/layout_reducer.cc (2)

316-339: FinalizeReducerOp variable extraction looks correct but has a potential edge case.

The updated logic correctly handles three paths:

  1. Direct BufferLoadNode (lines 319-320)
  2. RegionOp wrapping a BufferLoadNode (lines 321-327)
  3. Legacy GetVarFromAccessPtr fallback (lines 328-332)

However, the else branch at line 331-332 assumes that if arg[0] is a Call but not a RegionOp, it must be a valid tvm_access_ptr call. If a different call type is passed, GetVarFromAccessPtr will fail with an ICHECK. This is acceptable defensive behavior but consider whether a more descriptive error message would help debugging.


307-315: LGTM!

The new direct BufferLoadNode handling correctly mirrors the existing paths for RegionOp and tvm_access_ptr. The logic properly extracts the buffer variable and populates inside_reducer_range_ when the buffer is a registered reducer.

src/op/finalize_reducer.cc (1)

36-45: LGTM!

The constructor correctly uses NormalizeToBufferRegion to extract the buffer from args[0]. The implementation is clean and consistent with the PR's refactoring pattern.

src/op/fill.cc (1)

63-97: LGTM!

The refactored constructor is significantly cleaner. Using NormalizeToBufferRegion consolidates all the previous conditional branches into a single path while maintaining proper validation. The bound-checking logic (lines 76-95) correctly validates region dimensions and static bounds.

src/op/copy.cc (3)

132-156: LGTM!

The Copy constructor refactoring is clean and correct. The loop at lines 136-140 efficiently normalizes both source and destination arguments using NormalizeToBufferRegion. The optional argument handling (coalesced_width, disable_tma, eviction_policy) remains unchanged and correct.


1726-1739: LGTM!

The Conv2DIm2ColOp constructor correctly uses NormalizeToBufferRegion to extract source and destination buffers. The extraction of convolution parameters (nhw_step, c_step, kernel, stride, dilation, padding, eviction_policy) remains unchanged.


250-250: Minor: whitespace-only change.

Line 250 is marked as changed but appears to be whitespace-only (trailing space removal or similar). No functional impact.

tilelang/language/gemm.py (3)

55-62: LGTM!

The normalization of A, B, C to BufferRegion objects via to_buffer_region provides a consistent interface for extracting shape, stride, and offset information. This aligns with the PR's objective to standardize buffer/region handling.


95-102: LGTM!

The conversion workflow is correct:

  1. Extract C coordinates from the region's min values (line 95)
  2. Convert each BufferRegion to a tl.region call with appropriate access types (lines 97-99)
  3. Pass converted arguments to the intrinsic (lines 100-102)

The access types ("r" for A/B, "rw" for C) are appropriate for GEMM semantics.


87-92: LGTM!

The offset extraction correctly uses retrieve_offset on the region objects and validates the expected constraints on matrix dimension offsets.

src/op/region.cc (3)

21-53: LGTM with minor observation on extent validation.

The constructor logic correctly:

  1. Extracts the BufferLoad from args[0]
  2. Validates index count matches ndim
  3. Handles both Ramp and non-Ramp indices appropriately
  4. Rebuilds ranges with proper min/extent pairs

The Ramp handling (lines 33-43) correctly validates stride-1 and checks lanes match the provided extent when both are constants. However, the check at lines 37-42 only validates when both are IntImm. If one is dynamic and one is static, they could mismatch at runtime without warning.

This is acceptable given the PR's intent to standardize region handling, but consider adding a comment noting this is a best-effort static check.


70-76: Verify: Lower returns no-op statement.

Lower returns Evaluate(0) which is essentially a no-op. This is likely intentional since the RegionOp is described as a "bridge to carry BufferRegion via Call args" and doesn't generate actual code. Please confirm this is the expected behavior.


83-88: LGTM: Registration pattern follows established conventions.

The registration correctly:

  1. Sets TScriptPrinterName for script output
  2. Provides the TLOpBuilder lambda for constructing from args
  3. Uses -1 for variable input count
  4. Marks the effect kind as kPure (appropriate since it doesn't modify state)
tilelang/utils/language.py (1)

204-206: Verify: Potential infinite loop or missed resolution for nested let-bindings.

If a tir.Var resolves to another tir.Var via get_let_value, the code only resolves one level. If nested let-bindings exist (var1 → var2 → actual_value), this would fail to fully resolve.

Comment thread testing/python/issue/test_tilelang_issue_830.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/op/region.cc (1)

48-52: Potential crash: unchecked dereference of as_const_int result.

as_const_int(args[1]) returns an optional/nullable result. If args[1] is not a constant integer (e.g., a symbolic expression), dereferencing will cause undefined behavior or a crash.

  ObjectPtr<RegionOpNode> node = tvm::ffi::make_object<RegionOpNode>();
  node->buffer_ = load->buffer;
- node->access_mask_ = static_cast<int>(*as_const_int(args[1]));
+ auto access_mask_opt = as_const_int(args[1]);
+ ICHECK(access_mask_opt) << "RegionOp: args[1] (access_mask) must be a constant integer, got: "
+                         << args[1];
+ node->access_mask_ = static_cast<int>(*access_mask_opt);
  node->ranges_ = ranges;
  data_ = std::move(node);
🧹 Nitpick comments (2)
src/op/region.h (1)

31-32: Include order: local include before library include.

Consider reordering includes to follow the common convention of placing library/system includes before local includes, unless this project has established a different convention.

-#include "./operator.h"
-#include <tvm/tir/buffer.h>
+#include <tvm/tir/buffer.h>
+
+#include "./operator.h"
src/op/region.cc (1)

24-27: Add descriptive error message to ICHECK.

The ICHECK(load) on line 25 will produce a generic assertion failure. Adding context would help debugging when the invariant is violated.

  auto load = args[0].as<BufferLoadNode>();
- ICHECK(load);
+ ICHECK(load) << "RegionOp: args[0] must be a BufferLoad, got: " << args[0];
  ICHECK(load->indices.size() == ndim)
      << "load->indices.size() = " << load->indices << " ndim = " << ndim;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0a36d9 and 0b0cfaa.

📒 Files selected for processing (4)
  • examples/gemv/example_gemv.py (1 hunks)
  • examples/gemv/test_example_gemv.py (1 hunks)
  • src/op/region.cc (3 hunks)
  • src/op/region.h (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/gemv/test_example_gemv.py
  • examples/gemv/example_gemv.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/op/region.h (2)
src/op/region.cc (1)
  • RegionOp (21-53)
tilelang/ir.py (1)
  • RegionOp (73-74)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (3)
src/op/region.h (1)

1-86: LGTM! Well-documented header with clear API.

The documentation thoroughly explains why tl.region exists as a transport bridge rather than passing BufferRegion directly. The API is clean with:

  • Clear constructor argument mapping documented (args[0]=BufferLoad, args[1]=access mask, args[2+i]=extents)
  • Read-only reflection exposure for buffer, ranges, and access_mask
  • Appropriate accessor methods
src/op/region.cc (2)

33-46: Partial validation of Ramp lanes vs. provided extent.

The check on lines 37-42 only validates when both ramp->lanes and extent are IntImm. If one is constant and the other is symbolic, a mismatch won't be detected, but the code always uses ramp->lanes (line 43), silently ignoring the provided extent.

Consider adding a comment explaining this is intentional (Ramp is authoritative), or extending validation to handle mixed constant/symbolic cases.

      if (const auto *lanes_imm = ramp->lanes.as<IntImmNode>()) {
        if (const auto *ext_imm = extent.as<IntImmNode>()) {
          ICHECK_EQ(lanes_imm->value, ext_imm->value)
              << "Ramp lanes and provided extent must match";
        }
+       // When extent is symbolic but ramp->lanes is constant, we trust ramp->lanes.
+       // This is intentional: Ramp is the authoritative source for extent.
      }

55-58: LGTM! Clean implementation.

  • Clone() correctly creates a copy of the node
  • IsFullRegion() properly checks if ranges cover the entire buffer
  • Lower() and InferLayout() return no-op/empty as expected for a transport-only op
  • Op registration with TVM_REGISTER_OP is properly configured with pure call effect

Also applies to: 70-77, 79-94

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
examples/gemv/example_gemv.py (2)

337-345: Simplify check_correctness_and_bench signature; N and K are unused

N and K are never used inside this helper, which is what Ruff (ARG001) is flagging. Since all call sites are local to this file, you can safely drop them and annotate the flag:

-def check_correctness_and_bench(kernel, N, K, do_bench=True):
+def check_correctness_and_bench(kernel, do_bench: bool = True):

If you prefer to keep the dimensions for future use, alternatively rename them to _N / _K to silence the warning.


353-359: Align check_correctness_and_bench call sites with the simplified signature

If you drop N and K from the helper’s signature, the calls here can also be simplified:

-    check_correctness_and_bench(naive_gemv(N, K, 128, 128), N, K, do_bench=do_bench)
-    check_correctness_and_bench(naive_splitk_gemv(N, K, 32, 32), N, K, do_bench=do_bench)
-    check_correctness_and_bench(splitk_gemv(N, K, 32, 32, 32), N, K, do_bench=do_bench)
-    check_correctness_and_bench(splitk_gemv_vectorized(N, K, 2, 32), N, K, do_bench=do_bench)
-    check_correctness_and_bench(splitk_gemv_vectorized_tvm(N, K, 2, 32), N, K, do_bench=do_bench)
-    check_correctness_and_bench(
-        gemv_alloc_reducer(N, K, block_M=128, block_N=128), N, K, do_bench=do_bench)
+    check_correctness_and_bench(naive_gemv(N, K, 128, 128), do_bench=do_bench)
+    check_correctness_and_bench(naive_splitk_gemv(N, K, 32, 32), do_bench=do_bench)
+    check_correctness_and_bench(splitk_gemv(N, K, 32, 32, 32), do_bench=do_bench)
+    check_correctness_and_bench(splitk_gemv_vectorized(N, K, 2, 32), do_bench=do_bench)
+    check_correctness_and_bench(splitk_gemv_vectorized_tvm(N, K, 2, 32), do_bench=do_bench)
+    check_correctness_and_bench(
+        gemv_alloc_reducer(N, K, block_M=128, block_N=128), do_bench=do_bench)

This keeps the example code a bit cleaner and resolves the lint warning without changing behavior.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b0cfaa and c24f46e.

📒 Files selected for processing (1)
  • examples/gemv/example_gemv.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/gemv/example_gemv.py (2)
tilelang/profiler/__init__.py (2)
  • do_bench (217-279)
  • assert_allclose (75-144)
tilelang/jit/kernel.py (1)
  • get_profiler (405-421)
🪛 Ruff (0.14.5)
examples/gemv/example_gemv.py

337-337: Unused function argument: N

(ARG001)


337-337: Unused function argument: K

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)

@LeiWang1999
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/op/gemm.cc Outdated
Comment on lines 85 to 88
node->mbarPtr_ = args[16];
if (node->mbarPtr_.as<CallNode>()) {
node->mbar_ = vmap[GetVarFromAccessPtr(node->mbarPtr_)];
if (const auto *load = node->mbarPtr_.as<BufferLoadNode>()) {
node->mbar_ = load->buffer;
} else {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse mbar barrier handle from access_ptr calls

The new GEMM constructor only populates mbar_ when args[16] is a BufferLoad. The Python APIs still pass the barrier pointer as buffer.access_ptr(...) (a tvm_access_ptr Call from retrieve_ptr), so mbar_ is now left nullopt even when the caller supplies mbar. The TCGEN5MMA lowering later does ICHECK(mbar_.has_value()) and now fails for any GEMM invocation that provides an mbarrier through the documented API. The constructor needs to decode access_ptr calls (or otherwise preserve the buffer) instead of silently dropping them.

Useful? React with 👍 / 👎.

Comment thread src/op/gemm_py.cc Outdated
Comment on lines 85 to 89
node->mbarPtr_ = args[16];
if (node->mbarPtr_.as<CallNode>()) {
node->mbar_ = vmap[GetVarFromAccessPtr(node->mbarPtr_)];
if (const auto *load = node->mbarPtr_.as<BufferLoadNode>()) {
node->mbar_ = load->buffer;
} else {
node->mbar_ = std::nullopt;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve mbar argument in GemmPy when passed as access_ptr

GemmPy now only recognizes args[16] as a barrier buffer when it is a BufferLoad. The Python wrapper still passes buffer.access_ptr(...) for mbar, so node->mbar_ stays empty and TCGEN5 MMA lowering will trip the same ICHECK(mbar_.has_value()) despite the caller providing a barrier buffer. The constructor should continue to handle access_ptr-style calls rather than treating them as absent.

Useful? React with 👍 / 👎.

…ng `mbarPtr_` with `mbarRegion_` and updating related logic in both C++ and Python implementations. This change enhances the clarity and consistency of buffer region management.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/op/gemm.h (1)

90-103: Clarify the mbarRegion_ and mbar_ invariant with a comment.

Based on the search results, mbarRegion_ is used only internally in src/op/gemm.cc for lowering (MakeAccessPtrFromRegion) and is not exposed to external Python/FFI consumers. No reflection exposure is needed.

However, the coupling between mbar_ and mbarRegion_ is worth documenting. Since line 87 in src/op/gemm.cc shows node->mbar_ = node->mbarRegion_->buffer;, they are always synchronized. Consider adding a brief comment near the declaration (around line 100 in src/op/gemm.h) to clarify the invariant—for example: "mbarRegion_ holds the region; mbar_ is its buffer (null when no mbar operand provided)". This helps future maintainers avoid accidental desynchronization.

src/op/copy.cc (1)

131-155: Add validation for NormalizeToBufferRegion results.

The constructor directly dereferences the result of NormalizeToBufferRegion (lines 136–138) without checking whether it returns a valid BufferRegion. If args[i] is malformed or unsupported, this could lead to a null-pointer dereference or segmentation fault.

Add checks after calling NormalizeToBufferRegion:

 Copy::Copy(Array<PrimExpr> args) {
   ObjectPtr<CopyNode> node = tvm::ffi::make_object<CopyNode>();
   Array<Range> rgs[2];
   Buffer bf[2];
   for (int i = 0; i < 2; i++) {
     auto region = NormalizeToBufferRegion(args[i]);
+    ICHECK(region.defined()) << "Failed to normalize args[" << i << "] to BufferRegion";
     rgs[i] = region->region;
     bf[i] = region->buffer;
   }
src/op/atomic_add.cc (1)

25-64: Update constructor to ensure use_tma is always initialized and fix IntImm syntax

Your review identifies critical issues that are confirmed by code inspection:

  1. Null-dereference risk (Critical): The Lower method accesses use_tma->value unconditionally (if (use_tma->value != 0)), but the constructor only initializes use_tma when args.size() >= 3. Since AtomicAddNode declares IntImm use_tma; with no default initializer, this risks null dereference if a caller provides fewer than 3 arguments.

  2. IntImm syntax error (Critical): All other IntImm usages in the codebase follow the pattern IntImm(DataType::Int(32), value) with explicit dtype. The line node->memory_order = IntImm(0); appears to be incorrect syntax and likely will not compile.

  3. Documentation out of sync (Major): The Doxygen comment still describes args[2] as coalesced width, when the actual implementation interprets args[2] as use_tma, args[3] as memory_order, and args[4] as coalesced_width.

  4. NormalizeToBufferRegion description: The comment should mention that region-compatible expressions include tl.region() calls, not just BufferLoad/BufferRegion.

Recommended fixes:

  • Initialize use_tma with a default value (e.g., IntImm(DataType::Int(32), 0)) unconditionally in the constructor
  • Correct the IntImm syntax for memory_order to include the DataType argument
  • Update the Doxygen comment to match the actual args order and accepted expression types
♻️ Duplicate comments (3)
testing/python/issue/test_tilelang_issue_830.py (1)

28-28: Hardcoded device 0 remains unaddressed.

The hardcoded torch.cuda.set_device(0) can fail if device 0 is unavailable or busy in multi-GPU setups. The previous review suggested using a more flexible approach such as torch.cuda.current_device() or checking device availability before setting the device.

Consider applying the suggested approach from the previous review:

-    torch.cuda.set_device(0)
+    device = torch.cuda.current_device() if torch.cuda.is_available() else 0
+    torch.cuda.set_device(device)
src/op/gemm_py.cc (1)

83-85: Confirm mbar handling matches Python frontend; still only recognizes BufferLoad

mbarPtr_ is now always set to args[16], and mbar_ is populated only when that argument is a BufferLoadNode; otherwise mbar_ is set to std::nullopt. This is the same BufferLoad‑only behavior that an earlier review comment flagged: if the Python wrapper still passes buffer.access_ptr(...) for mbar, this path will again leave mbar_ empty and the downstream ICHECK(mbar_.has_value()) in the tcgen5 MMA lowering will still fire even though the user provided a barrier.

Please verify that the Python/tilelang side has been updated so that mbar is passed as a BufferLoad (or another form that you handle here), or extend this constructor to recognize the access_ptr call pattern and recover the underlying buffer for mbar_ as before.

tilelang/language/utils.py (1)

7-20: Code duplication with tilelang/utils/language.py.

The functions region(), buffer_load_to_tile_region(), and buffer_region_to_tile_region() duplicate functionality from to_buffer_region() in tilelang/utils/language.py. The latter is more comprehensive and already handles Buffer, BufferLoad, BufferRegion, and tir.Var inputs with extent-based paths.

Consider consolidating these helpers into a single location to avoid divergence. Based on past review comments asking "Why don't we use to_buffer_region here?", this duplication appears intentional but may lead to maintenance issues.

🧹 Nitpick comments (9)
src/op/finalize_reducer.cc (1)

22-33: Docstring is stale after vmap removal.

The docstring still references "a buffer map" and extracting from vmap, but the constructor signature no longer accepts a BufferMap parameter. Update to reflect the new normalization-based approach.

 /**
- * @brief Construct a FinalizeReducerOp from TL operator arguments and a buffer
- * map.
+ * @brief Construct a FinalizeReducerOp from TL operator arguments.
  *
- * Extracts the reducer Buffer from `vmap` using the variable referenced by
- * `args[0]` and sets the reduction operation type from the integer code in
- * `args[1]`.
+ * Normalizes `args[0]` (BufferRegion/BufferLoad/tl.region) to a BufferRegion
+ * and extracts the underlying Buffer as the reducer. Sets the reduction
+ * operation type from the integer code in `args[1]`.
  *
  * @param args TL operator arguments: expects at least two elements where
- *             `args[0]` is an access pointer identifying the reducer variable
- * and `args[1]` is an integer encoding a `ReducerOpType` (e.g., Sum/Max/Min).
+ *             `args[0]` is a region expression identifying the reducer buffer
+ *             and `args[1]` is an integer encoding a `ReducerOpType` (e.g.,
+ *             Sum/Max/Min).
  */
src/op/copy.h (1)

299-308: Consider conventional naming for public fields.

The trailing underscore suffix (e.g., src_, stride_, padding_) is unconventional for public data members. Most C++ style guides (including Google's) reserve trailing underscores for private or protected members. Since these fields are public, consider either:

  1. Removing the trailing underscores (e.g., src, stride, padding), or
  2. Making these fields private/protected and providing accessors if needed.

This applies to: srcRegion_, dstRegion_, src_, dst_, stride_, padding_, dilation_, kernel_, eviction_policy_, nhw_step_, c_step_.

src/op/gemm_sp.cc (1)

17-17: GemmSP constructor region normalization looks correct; docstring still mentions vmap/buffer map

The new GemmSP::GemmSP(Array<PrimExpr> args) wiring via NormalizeToBufferRegion(args[0..3]) and then assigning a_/e_/b_/c_ from the corresponding BufferRegion::buffer is consistent with the ordering documented earlier (A, E, B, C) and matches the broader vmap‑free pattern used elsewhere. The added #include "utils.h" is the right dependency for NormalizeToBufferRegion.

The block comment above the constructor (Lines 66–85), however, still describes construction “from TL call arguments and a buffer map” and references vmap lookup semantics, which are no longer part of the signature or implementation. Consider updating that doc to describe the region‑based contract (allowed forms: BufferRegion, BufferLoad, tl.region(...)) and to remove references to BufferMap vmap so future readers don’t assume a second parameter exists.

Also applies to: 86-95

src/op/gemm_py.cc (1)

27-52: Update GemmPy constructor docs to match vmap‑free, region‑based implementation

The implementation now takes only Array<PrimExpr> args and resolves A/B/C via NormalizeToBufferRegion(args[0..2]), which is aligned with the shared utility and the rest of the PR.

The docblock above still talks about “a buffer map”, vmap‑based resolution, and a (args, vmap) constructor that no longer exists. It would be good to refresh that comment to:

  • Drop references to BufferMap vmap.
  • Explicitly document that args[0..2] must be region‑compatible expressions (BufferRegion, BufferLoad, or tl.region(...)) and that buffers are taken from those regions.
  • Keep the index layout description for the scalar parameters as is (since that part is still accurate).

This keeps the public contract in sync with the refactored API.

Also applies to: 53-58

tilelang/language/utils.py (1)

16-19: Consider handling the case where extents exceeds indices.

The current logic handles len(indices) > len(extents) by prepending 1s to extents, but silently passes through when len(extents) > len(indices). The assertion will fail in that case, but an explicit check with a clearer error message would improve debuggability.

     if len(indices) > len(extents):
         extents = [tir.IntImm("int32", 1) for _ in range(len(indices) - len(extents))
                   ] + list(extents)
-    assert len(indices) == len(extents), f"indices = {indices}, extents = {extents}"
+    assert len(indices) == len(extents), (
+        f"extents length ({len(extents)}) exceeds indices length ({len(indices)}): "
+        f"indices = {indices}, extents = {extents}"
+    )
tilelang/language/gemm.py (2)

13-14: Mixed import sources for region utilities.

The code imports to_buffer_region from tilelang.utils.language (line 7) and buffer_region_to_tile_region from tilelang.language.utils (line 14). Both modules provide region handling utilities. Consider consolidating to reduce confusion about which module to use for region operations.


96-98: Simplify list comprehensions.

The list comprehensions [r for r in A_shape] are equivalent to list(A_shape). Consider simplifying for clarity.

-    A_arg = buffer_region_to_tile_region(A_region, "r", [r for r in A_shape])
-    B_arg = buffer_region_to_tile_region(B_region, "r", [r for r in B_shape])
-    C_arg = buffer_region_to_tile_region(C_region, "rw", [r for r in C_shape])
+    A_arg = buffer_region_to_tile_region(A_region, "r", list(A_shape))
+    B_arg = buffer_region_to_tile_region(B_region, "r", list(B_shape))
+    C_arg = buffer_region_to_tile_region(C_region, "rw", list(C_shape))
tilelang/utils/language.py (2)

3-3: Circular-like dependency between utility modules.

tilelang/utils/language.py imports from tilelang/language/utils.py, while gemm.py imports from both. Consider consolidating region utilities into a single canonical location to simplify the dependency graph.


188-189: Use TypeError for invalid type.

Per static analysis, TypeError is more appropriate when the issue is an unsupported type.

         else:
-            raise ValueError(f"Unsupported type: {type(indice)} for index {i}")
+            raise TypeError(f"Unsupported index type: {type(indice)} at position {i}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c24f46e and 86146a7.

📒 Files selected for processing (20)
  • examples/deepseek_mla/test_example_mla_decode.py (0 hunks)
  • src/op/atomic_add.cc (3 hunks)
  • src/op/copy.cc (8 hunks)
  • src/op/copy.h (3 hunks)
  • src/op/fill.cc (2 hunks)
  • src/op/finalize_reducer.cc (2 hunks)
  • src/op/gemm.cc (3 hunks)
  • src/op/gemm.h (2 hunks)
  • src/op/gemm_py.cc (2 hunks)
  • src/op/gemm_sp.cc (2 hunks)
  • src/op/gemm_sp.h (3 hunks)
  • src/op/operator.cc (2 hunks)
  • src/op/utils.h (1 hunks)
  • testing/python/issue/test_tilelang_issue_830.py (1 hunks)
  • tilelang/language/copy.py (3 hunks)
  • tilelang/language/experimental/gemm_sp.py (2 hunks)
  • tilelang/language/gemm.py (3 hunks)
  • tilelang/language/reduce.py (8 hunks)
  • tilelang/language/utils.py (1 hunks)
  • tilelang/utils/language.py (3 hunks)
💤 Files with no reviewable changes (1)
  • examples/deepseek_mla/test_example_mla_decode.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/op/fill.cc
  • tilelang/language/experimental/gemm_sp.py
  • tilelang/language/copy.py
  • src/op/operator.cc
  • src/op/utils.h
  • src/op/gemm.cc
  • tilelang/language/reduce.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/issue/test_tilelang_issue_830.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.

Applied to files:

  • src/op/copy.cc
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.

Applied to files:

  • src/op/copy.cc
🧬 Code graph analysis (10)
src/op/gemm_py.cc (1)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-53)
  • NormalizeToBufferRegion (15-15)
src/op/finalize_reducer.cc (4)
tilelang/ir.py (1)
  • FinalizeReducerOp (53-54)
tilelang/language/utils.py (1)
  • region (7-10)
src/transform/pipeline_planning.cc (4)
  • region (470-477)
  • region (470-470)
  • region (480-487)
  • region (480-480)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-53)
  • NormalizeToBufferRegion (15-15)
tilelang/utils/language.py (2)
tilelang/language/utils.py (1)
  • region (7-10)
tilelang/language/frame.py (2)
  • has_let_value (188-197)
  • get_let_value (200-209)
src/op/gemm_sp.h (2)
src/op/gemm_sp.cc (1)
  • GemmSP (86-113)
tilelang/ir.py (1)
  • GemmSP (48-49)
tilelang/language/gemm.py (1)
tilelang/language/utils.py (2)
  • buffer_region_to_tile_region (23-35)
  • region (7-10)
src/op/atomic_add.cc (4)
tilelang/ir.py (1)
  • AtomicAdd (15-16)
src/transform/atomicadd_vectorize.cc (4)
  • node (186-210)
  • node (186-186)
  • node (212-269)
  • node (212-212)
tilelang/language/utils.py (1)
  • region (7-10)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-53)
  • NormalizeToBufferRegion (15-15)
src/op/copy.h (2)
src/op/copy.cc (2)
  • Copy (131-155)
  • Conv2DIm2ColOp (1724-1739)
tilelang/ir.py (2)
  • Copy (20-21)
  • Conv2DIm2ColOp (25-26)
tilelang/language/utils.py (3)
src/op/region.h (1)
  • tvm (34-89)
tilelang/language/ast/ir.py (2)
  • buffer (93-161)
  • min (1578-1594)
tilelang/language/tir/op.py (1)
  • call_intrin (120-145)
src/op/gemm_sp.cc (1)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-53)
  • NormalizeToBufferRegion (15-15)
src/op/gemm.h (2)
src/op/gemm.cc (1)
  • Gemm (53-95)
tilelang/ir.py (1)
  • Gemm (43-44)
🪛 Ruff (0.14.5)
tilelang/utils/language.py

189-189: Prefer TypeError exception for invalid type

(TRY004)


189-189: Avoid specifying long messages outside the exception class

(TRY003)


238-238: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (16)
testing/python/issue/test_tilelang_issue_830.py (1)

20-20: Good addition of CUDA requirement decorator.

The @tilelang.testing.requires_cuda decorator properly ensures this test only runs in CUDA-enabled environments, preventing unclear failures when CUDA is unavailable.

src/op/finalize_reducer.cc (2)

14-15: LGTM!

The addition of utils.h for NormalizeToBufferRegion aligns with the refactoring to remove vmap-based buffer lookups.


34-43: LGTM! The refactored buffer extraction is clean.

The normalization via NormalizeToBufferRegion correctly handles BufferRegion, BufferLoad, and tl.region expressions, and the error handling is delegated to the utility function which fatals on unsupported cases.

src/op/gemm.h (1)

144-149: Gemm constructor API change looks consistent with implementation.

The Gemm(Array<PrimExpr> args) signature matches the implementation in src/op/gemm.cc and aligns with the PR’s goal of removing BufferMap vmap from operator constructors. No header/impl mismatch or obvious correctness issues here.

src/op/copy.cc (1)

19-19: LGTM—include updated to reflect utility-based normalization.

The shift from region.h to utils.h aligns with the broader refactor to use NormalizeToBufferRegion.

src/op/copy.h (2)

283-283: LGTM—constructor signature updated consistently.

The removal of the vmap parameter aligns with the PR objectives and matches the implementation in copy.cc.


349-349: LGTM—constructor signature updated consistently.

The removal of the vmap parameter matches the implementation in copy.cc.

src/op/gemm_sp.h (2)

56-56: Region fields and reflection on GemmSPNode align with new normalization flow

Adding BufferRegion aRegion_, bRegion_, cRegion_, eRegion_ and exposing them via def_ro("aRegion"/"bRegion"/"cRegion"/"eRegion", ...) matches the new NormalizeToBufferRegion‑based constructor and gives frontends useful, structured access to the exact regions used by GemmSP. This looks consistent with the rest of the node’s exposed state and should make debugging/layout tooling easier.

Also applies to: 79-82


104-104: GemmSP constructor declaration matches implementation and vmap‑free API

The header’s TVM_DLL GemmSP(Array<PrimExpr> args); now matches the implementation in gemm_sp.cc and the PR’s goal of removing the BufferMap vmap parameter. Assuming all callers (e.g., Python/tilelang bindings) have been updated to drop the vmap argument, this API change is coherent.

tilelang/language/utils.py (3)

23-35: LGTM!

The clamping logic correctly ensures extents don't exceed the region's bounds, and the fallback to region_extents[i] for trailing dimensions is appropriate.


38-57: LGTM!

Standard row-major index decomposition algorithm. The implementation correctly iterates from the last dimension and reverses the result.


103-106: The linear index implementation is correct. Verification against the documented examples confirms the formula linear = linear * stride + idx produces the expected results. For instance, tracing linear_index(i, j, k, stride_j, stride_k) yields i*stride_j*stride_k + j*stride_k + k, which matches the docstring example exactly. The nested multiplication is the standard and correct approach for row-major indexing with cumulative strides.

tilelang/language/gemm.py (1)

104-163: LGTM!

The public wrappers maintain a clean API while delegating to the unified _gemm_impl. The environment-based version selection provides flexibility for experimentation.

tilelang/utils/language.py (1)

196-238: LGTM with note on dual return type.

The function correctly handles the conversion paths. The dual return type (PrimExpr | BufferRegion) based on whether extents is provided is well-documented in the docstring.

src/op/atomic_add.cc (2)

7-14: Include switch to utils.h for buffer-region normalization looks correct

Including "utils.h" here matches the new use of NormalizeToBufferRegion and aligns with the shared op-utilities pattern across the PR. No issues from this change alone.


542-550: Trivial namespace-closing change is fine

The closing namespaces remain correct and consistent; nothing to change here.

Comment thread src/op/copy.cc
Comment on lines +1724 to 1738
Conv2DIm2ColOp::Conv2DIm2ColOp(Array<PrimExpr> args) {
ObjectPtr<Conv2DIm2ColOpNode> node =
tvm::ffi::make_object<Conv2DIm2ColOpNode>();
node->src = vmap[GetVarFromAccessPtr(args[0])];
node->dst = vmap[GetVarFromAccessPtr(args[1])];
node->nhw_step = args[2];
node->c_step = args[3];
node->kernel = args[4].as<IntImm>().value()->value;
node->stride = args[5].as<IntImm>().value()->value;
node->dilation = args[6].as<IntImm>().value()->value;
node->padding = args[7].as<IntImm>().value()->value;
node->eviction_policy = args[8].as<IntImm>().value()->value;
node->srcRegion_ = NormalizeToBufferRegion(args[0]);
node->dstRegion_ = NormalizeToBufferRegion(args[1]);
node->src_ = node->srcRegion_->buffer;
node->dst_ = node->dstRegion_->buffer;
node->nhw_step_ = args[2];
node->c_step_ = args[3];
node->kernel_ = args[4].as<IntImm>().value()->value;
node->stride_ = args[5].as<IntImm>().value()->value;
node->dilation_ = args[6].as<IntImm>().value()->value;
node->padding_ = args[7].as<IntImm>().value()->value;
node->eviction_policy_ = args[8].as<IntImm>().value()->value;
data_ = std::move(node);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add validation for region normalization and safe casting.

Two issues:

  1. Lines 1727–1728 directly dereference NormalizeToBufferRegion results without validation (same issue as the Copy constructor).
  2. Lines 1733–1737 use .as<IntImm>().value()->value without null checks. If any argument is not an IntImm, this will cause a null-pointer dereference.

Add validation and use safer casting:

 Conv2DIm2ColOp::Conv2DIm2ColOp(Array<PrimExpr> args) {
   ObjectPtr<Conv2DIm2ColOpNode> node =
       tvm::ffi::make_object<Conv2DIm2ColOpNode>();
   node->srcRegion_ = NormalizeToBufferRegion(args[0]);
+  ICHECK(node->srcRegion_.defined()) << "Failed to normalize source region";
   node->dstRegion_ = NormalizeToBufferRegion(args[1]);
+  ICHECK(node->dstRegion_.defined()) << "Failed to normalize destination region";
   node->src_ = node->srcRegion_->buffer;
   node->dst_ = node->dstRegion_->buffer;
   node->nhw_step_ = args[2];
   node->c_step_ = args[3];
-  node->kernel_ = args[4].as<IntImm>().value()->value;
-  node->stride_ = args[5].as<IntImm>().value()->value;
-  node->dilation_ = args[6].as<IntImm>().value()->value;
-  node->padding_ = args[7].as<IntImm>().value()->value;
-  node->eviction_policy_ = args[8].as<IntImm>().value()->value;
+  node->kernel_ = Downcast<IntImm>(args[4])->value;
+  node->stride_ = Downcast<IntImm>(args[5])->value;
+  node->dilation_ = Downcast<IntImm>(args[6])->value;
+  node->padding_ = Downcast<IntImm>(args[7])->value;
+  node->eviction_policy_ = Downcast<IntImm>(args[8])->value;
   data_ = std::move(node);
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/op/copy.cc around lines 1724–1738, the constructor dereferences the
results of NormalizeToBufferRegion and uses .as<IntImm>().value()->value without
any null checks; validate that NormalizeToBufferRegion(args[0]) and
NormalizeToBufferRegion(args[1]) return non-null regions before accessing
->buffer (return or raise a clear error if null), and for args[4]..args[8] fetch
each with auto int_imm = args[i].as<IntImm>(); check int_imm is non-null and
then use int_imm->value to assign node->kernel_, node->stride_, node->dilation_,
node->padding_, and node->eviction_policy_; ensure you fail fast with
informative error messages if any validation fails.

Comment thread tilelang/language/gemm.py
Comment on lines +93 to +94
mbar = to_buffer_region(mbar, access_type="rw") if mbar is not None else tir.const(0, "uint32")
C_coords = [r.min for r in C_region.region]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify mbar conversion type consistency.

When mbar is not None, to_buffer_region(mbar, access_type="rw") is called without extents, which returns a BufferRegion rather than a PrimExpr (tl.region call). When mbar is None, a tir.const(0, "uint32") is used. Verify that the intrinsic expects both types in the same argument position, or if mbar should be converted to a tile region call like A/B/C.


🏁 Script executed:

#!/bin/bash
# Check the tl.gemm intrinsic signature and mbar handling in C++ side
rg -n "mbar" --type cpp -C 5 src/op/gemm

Length of output: 51


🏁 Script executed:

#!/bin/bash
# Find C++ source files related to gemm
fd "gemm" --type f -e cpp -e h src/

Length of output: 505


🏁 Script executed:

#!/bin/bash
# Look at the full context around line 93-101 in gemm.py
cat tilelang/language/gemm.py | sed -n '85,105p'

Length of output: 1057


🏁 Script executed:

#!/bin/bash
# Find to_buffer_region function definition
rg -n "def to_buffer_region" --type py -A 10

Length of output: 884


🏁 Script executed:

#!/bin/bash
# Find where tl.gemm or the intrinsic is called/defined
rg -n "tl\.gemm|intrinsic.*gemm" --type py -C 3

Length of output: 3304


🏁 Script executed:

#!/bin/bash
# Get full to_buffer_region implementation
cat tilelang/utils/language.py | sed -n '196,260p'

Length of output: 3021


🏁 Script executed:

#!/bin/bash
# Find buffer_region_to_tile_region function
rg -n "def buffer_region_to_tile_region" --type py -A 15

Length of output: 1264


🏁 Script executed:

#!/bin/bash
# Check C++ gemm.h for intrinsic signature and mbar parameter
cat src/op/gemm.h | head -150

Length of output: 4575


🏁 Script executed:

#!/bin/bash
# Check gemm_sp.py for how it handles regions in intrinsic call
cat tilelang/language/experimental/gemm_sp.py | sed -n '60,90p'

Length of output: 819


🏁 Script executed:

#!/bin/bash
# Search for other calls to buffer_region_to_tile_region to understand pattern
rg -n "buffer_region_to_tile_region" --type py -B 2 -A 2

Length of output: 1356


🏁 Script executed:

#!/bin/bash
# Check if mbar should be converted like A/B/C or if BufferRegion is acceptable
rg -n "mbar" --type py -B 3 -A 3 tilelang/language/gemm.py

Length of output: 1852


Fix type inconsistency for mbar at line 93.

The two branches of the conditional return different types: when mbar is not None, to_buffer_region(mbar, access_type="rw") without extents returns a BufferRegion; when mbar is None, tir.const(0, "uint32") returns a PrimExpr. However, A_arg, B_arg, and C_arg (lines 96–98) all use buffer_region_to_tile_region, which returns PrimExpr for the intrinsic call. To maintain consistency, pass mbar's shape as extents to to_buffer_region:

mbar = to_buffer_region(mbar, access_type="rw", extents=[e for e in mbar.shape]) if mbar is not None else tir.const(0, "uint32")
🤖 Prompt for AI Agents
In tilelang/language/gemm.py around lines 93–94, the conditional sets mbar to a
BufferRegion in one branch and a PrimExpr in the other, causing a type mismatch
downstream; update the non-None branch to call to_buffer_region with extents
derived from mbar.shape (e.g., extents=[e for e in mbar.shape]) so both branches
produce a compatible PrimExpr-like representation for subsequent
buffer_region_to_tile_region usage.

Comment on lines 178 to 180
if isinstance(indice, tir.Ramp):
assert extents is None, "extents should be provided for BufferLoad with Ramp indices"
regions.append(ir.Range.from_min_extent(indice.base, indice.lanes))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assertion message is inverted.

The assertion message states "extents should be provided for BufferLoad with Ramp indices" but the condition assert extents is None enforces the opposite—that extents must NOT be provided when Ramp indices are present.

         if isinstance(indice, tir.Ramp):
-            assert extents is None, "extents should be provided for BufferLoad with Ramp indices"
+            assert extents is None, "extents must not be provided for BufferLoad with Ramp indices"
             regions.append(ir.Range.from_min_extent(indice.base, indice.lanes))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(indice, tir.Ramp):
assert extents is None, "extents should be provided for BufferLoad with Ramp indices"
regions.append(ir.Range.from_min_extent(indice.base, indice.lanes))
if isinstance(indice, tir.Ramp):
assert extents is None, "extents must not be provided for BufferLoad with Ramp indices"
regions.append(ir.Range.from_min_extent(indice.base, indice.lanes))
🤖 Prompt for AI Agents
In tilelang/utils/language.py around lines 178 to 180, the assertion is
inverted: it currently asserts extents is None while the message says extents
should be provided for BufferLoad with Ramp indices; change the assertion to
assert extents is not None and keep/update the message to "extents should be
provided for BufferLoad with Ramp indices" so the condition matches the message
(i.e., require extents when indice is a tir.Ramp).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tilelang/language/copy.py (1)

64-75: Update stale _to_region comment and verify BufferRegion broadcasting behavior

  • Line 65 still mentions _to_region, but the implementation now uses to_buffer_region. The comment should reflect the new helper to avoid confusion for future readers.
  • For BufferRegion inputs, to_buffer_region asserts that len(extents) == len(obj.region). If legalize_pairwise_extents ever changes the per‑operand rank during broadcasting (e.g., padding a shorter rank with leading 1s), this could trigger an assertion when src/dst are BufferRegions.

I recommend:

  • Updating the comment to reference to_buffer_region.
  • Double‑checking that legalize_pairwise_extents never alters the rank of src_extent/dst_extent relative to their underlying buffer/region rank, or otherwise ensuring that only rank‑compatible extents are passed for BufferRegion cases.
-    # for src and dst, so we can pass them unchanged into _to_region.
+    # for src and dst, so we can pass them unchanged into to_buffer_region.
tilelang/language/fill.py (1)

22-36: Extent derivation logic looks correct; consider explicit error on unsupported inputs

The new handling is consistent with to_buffer_region:

  • tir.Buffer → extents from buffer.shape.
  • tir.BufferRegion → extents from each Range.extent.
  • tir.BufferLoad → extents from get_buffer_region_from_load when available, otherwise per‑index extent 1.

This should produce well-formed tl.region descriptors for all supported cases. For the final else, you currently set extents = [] and rely on to_buffer_region to raise on unsupported types (e.g., a tir.Var without a let binding). For clearer error reporting and to avoid passing obviously invalid state downstream, you could raise directly in fill:

-    # Build tl.region as argument
-    if isinstance(buffer, tir.Buffer):
-        extents = list(buffer.shape)
-    elif isinstance(buffer, tir.BufferRegion):
-        extents = [r.extent for r in buffer.region]
-    elif isinstance(buffer, tir.BufferLoad):
-        region = get_buffer_region_from_load(buffer)
-        if region is not None:
-            extents = [r.extent for r in region.region]
-        else:
-            extents = [tir.IntImm("int32", 1) for _ in buffer.indices]
-    else:
-        extents = []
-    return tir.call_intrin("handle", tir.op.Op.get("tl.fill"),
-                           to_buffer_region(buffer, access_type="w", extents=extents), value)
+    # Build tl.region as argument
+    if isinstance(buffer, tir.Buffer):
+        extents = list(buffer.shape)
+    elif isinstance(buffer, tir.BufferRegion):
+        extents = [r.extent for r in buffer.region]
+    elif isinstance(buffer, tir.BufferLoad):
+        region = get_buffer_region_from_load(buffer)
+        if region is not None:
+            extents = [r.extent for r in region.region]
+        else:
+            extents = [tir.IntImm("int32", 1) for _ in buffer.indices]
+    else:
+        raise TypeError(
+            f"fill expects tir.Buffer, tir.BufferRegion, or tir.BufferLoad, "
+            f"but got {type(buffer)}"
+        )
+    return tir.call_intrin(
+        "handle",
+        tir.op.Op.get("tl.fill"),
+        to_buffer_region(buffer, access_type="w", extents=extents),
+        value,
+    )

This keeps the successful paths unchanged while surfacing invalid usages with a more precise message.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86146a7 and 313c2f1.

📒 Files selected for processing (3)
  • tilelang/language/atomic.py (2 hunks)
  • tilelang/language/copy.py (3 hunks)
  • tilelang/language/fill.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tilelang/language/atomic.py
🧰 Additional context used
🧬 Code graph analysis (2)
tilelang/language/fill.py (2)
tilelang/utils/language.py (2)
  • get_buffer_region_from_load (162-193)
  • to_buffer_region (196-238)
tilelang/language/utils.py (1)
  • region (7-10)
tilelang/language/copy.py (2)
tilelang/utils/language.py (1)
  • to_buffer_region (196-238)
tilelang/language/tir/op.py (1)
  • call_intrin (120-145)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
🔇 Additional comments (3)
tilelang/language/copy.py (2)

5-9: Import of to_buffer_region is consistent with new usage

The added to_buffer_region import is correctly wired up for both copy and c2d_im2col, and the remaining imports still match the helpers used in this file. No changes needed here.


110-117: Region conversion in c2d_im2col matches new intrinsic expectations

Using to_buffer_region(img, access_type="r") and to_buffer_region(col, access_type="w") before calling tl.c2d_im2col aligns with the new region‑normalized operator interface and keeps the Python side consistent with other ops. Given the current signature (both args are tir.Buffer), returning full‑buffer BufferRegions here is reasonable.

tilelang/language/fill.py (1)

5-5: Import consolidation aligns with region utilities

Importing both get_buffer_region_from_load and to_buffer_region from tilelang.utils.language matches their usage below and keeps region handling centralized; no issues here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tilelang/intrinsics/mfma_macro_generator.py (1)

257-301: self._legalize_to_buffer_region is used but never defined on MatrixCoreIntrinEmitter.

ldmatrix_a and ldmatrix_b both call self._legalize_to_buffer_region(...), but this class (and its parents) do not define that method. At runtime this will raise an AttributeError as soon as these paths are executed.

You should either:

  • Add a _legalize_to_buffer_region helper on MatrixCoreIntrinEmitter (e.g. mirroring the implementation used in TensorCoreIntrinEmitter in mma_macro_generator.py), or
  • Switch these calls to a shared module-level helper that you import and call directly, e.g. _legalize_to_buffer_region(A_shared_buf).

A minimal fix in this file following the method approach could look like:

 class MatrixCoreIntrinEmitter:
@@
     def get_ldmatrix_index_map(self, is_b=False):
@@
         return index_map, reverse_index_map
@@
     def ldmatrix_b(self, B_local_buf, B_shared_buf: Buffer | BufferRegion, ki, rk=0):
@@
-        # legalize shared buffer to region
-        B_region = self._legalize_to_buffer_region(B_shared_buf)
+        # legalize shared buffer to region
+        B_region = self._legalize_to_buffer_region(B_shared_buf)
@@
         return _warp_ldmatrix_b(B_local_buf, B_shared_buf, ki, thread_binding, rk)
+
+    def _legalize_to_buffer_region(
+        self, obj: Buffer | BufferRegion
+    ) -> BufferRegion:
+        if isinstance(obj, BufferRegion):
+            return obj
+        if isinstance(obj, Buffer):
+            # Full-region BufferRegion covering entire buffer
+            mins = [0] * len(obj.shape)
+            ranges = [tvm.ir.Range.from_min_extent(m, e) for m, e in zip(mins, obj.shape)]
+            return BufferRegion(obj, ranges)
+        raise TypeError(f"Unsupported argument type for BufferRegion: {type(obj)}")

(You can extend this helper to support BufferLoad as in mma_macro_generator.py if needed.)

Also applies to: 303-352

tilelang/intrinsics/mma_macro_generator.py (1)

231-252: _legalize_to_buffer_region is defined without self but used as an instance method.

ldmatrix_a/ldmatrix_b call self._legalize_to_buffer_region(A_shared_buf/B_shared_buf), but the helper is defined as:

def _legalize_to_buffer_region(obj: Buffer | BufferLoad | BufferRegion) -> BufferRegion:
    ...

inside the class. As currently written, this is treated as a bound instance method, so calling self._legalize_to_buffer_region(x) will pass two positional arguments (implicit self + x) to a function that only expects one, causing a TypeError at runtime.

Make it an instance method by adding self as the first parameter (and keeping the body unchanged), e.g.:

-    def _legalize_to_buffer_region(obj: Buffer | BufferLoad | BufferRegion) -> BufferRegion:
+    def _legalize_to_buffer_region(
+        self, obj: Buffer | BufferLoad | BufferRegion
+    ) -> BufferRegion:

After this change, self._legalize_to_buffer_region(...) in all the call sites will bind correctly and the legalization logic will work as intended.

Also applies to: 274-305, 347-407, 803-827

🧹 Nitpick comments (2)
tilelang/utils/__init__.py (1)

5-19: Drop redundant # noqa: F401 for get_buffer_region_from_load.

Ruff reports this # noqa: F401 as unused on Line 18. Since get_buffer_region_from_load is now part of the public API and actually used elsewhere, the noqa is unnecessary and can be removed:

-    get_buffer_region_from_load,  # noqa: F401
+    get_buffer_region_from_load,

This keeps the file aligned with the linter configuration.

tilelang/intrinsics/mma_macro_generator.py (1)

819-827: Optional: simplify the error raise to satisfy Ruff TRY003.

Ruff flags the long formatted message in the final raise ValueError(...). If you want to appease TRY003 without adding a custom exception class, you can simplify the error to avoid building a long, type-specific string:

-        raise ValueError(f"Unsupported argument type for BufferRegion: {type(obj)}")
+        raise TypeError("Unsupported argument type for BufferRegion")

Given this is a private helper, changing from ValueError to TypeError is also semantically reasonable (the error is about an unexpected argument type).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 313c2f1 and c240486.

📒 Files selected for processing (4)
  • tilelang/intrinsics/mfma_macro_generator.py (2 hunks)
  • tilelang/intrinsics/mma_macro_generator.py (6 hunks)
  • tilelang/intrinsics/mma_sm70_macro_generator.py (3 hunks)
  • tilelang/utils/__init__.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tilelang/intrinsics/mma_sm70_macro_generator.py (2)
tilelang/utils/language.py (1)
  • is_fragment (105-116)
tilelang/intrinsics/mma_macro_generator.py (1)
  • _legalize_to_buffer_region (803-827)
tilelang/utils/__init__.py (1)
tilelang/utils/language.py (1)
  • get_buffer_region_from_load (162-193)
tilelang/intrinsics/mfma_macro_generator.py (1)
tilelang/intrinsics/mma_macro_generator.py (1)
  • _legalize_to_buffer_region (803-827)
🪛 Ruff (0.14.5)
tilelang/utils/__init__.py

18-18: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tilelang/intrinsics/mma_macro_generator.py

827-827: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (1)
tilelang/intrinsics/mma_macro_generator.py (1)

85-90: FP64 MMA path initialization and tiling look consistent.

The new fp64 path (a_dtype bits 64) coherently sets M_DIM = 8, n_dim = 8, k_dim = 4, and selects mma_prefix = "m8n8k4". _initialize_micro_size’s fp64 branch (micro tile 8×8, warp_cols = warp_col_tiles // 8) and _initialize_local_size then yield local_size_a == local_size_b == 1, local_size_out == 2, which matches the comments in the fp64 load code. No issues from a tiling/layout standpoint.

Also applies to: 143-174, 127-141

from tilelang import tvm as tvm
from tvm.runtime import convert
from tilelang.utils import is_fragment, to_buffer_region
from tilelang.utils import is_fragment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

TensorCoreIntrinEmitter uses self._legalize_to_buffer_region without defining it.

In ldmatrix_a (Line 210) and ldmatrix_b (Line 251), you call self._legalize_to_buffer_region(...), but TensorCoreIntrinEmitter in this file does not define that method, nor is it inherited. This will raise an AttributeError when these intrinsics are used.

Given you already have a legalization helper in mma_macro_generator.TensorCoreIntrinEmitter, consider one of:

  • Defining the same helper here as an instance method, e.g.:
 class TensorCoreIntrinEmitter:
@@
     def make_mma_store_layout(self, local_buf: Buffer) -> T.Fragment:
@@
         return T.Fragment(
             shape,
             forward_thread_fn=forward_thread,
             forward_index_fn=forward_index,
         )
+
+    def _legalize_to_buffer_region(
+        self, obj: Buffer | BufferRegion
+    ) -> BufferRegion:
+        if isinstance(obj, BufferRegion):
+            return obj
+        if isinstance(obj, Buffer):
+            mins = [0] * len(obj.shape)
+            ranges = [tvm.ir.Range.from_min_extent(m, e) for m, e in zip(mins, obj.shape)]
+            return BufferRegion(obj, ranges)
+        raise TypeError(f"Unsupported argument type for BufferRegion: {type(obj)}")

(Import tvm.ir.Range as Range and use it, or adapt to your preferred style), or

  • Importing a shared module-level _legalize_to_buffer_region and calling it directly instead of via self.

Until this is addressed, the new region-legalization path here is not usable.

Also applies to: 190-233, 234-285

🤖 Prompt for AI Agents
In tilelang/intrinsics/mma_sm70_macro_generator.py around lines 190-285
(specifically ldmatrix_a at ~210 and ldmatrix_b at ~251),
TensorCoreIntrinEmitter calls self._legalize_to_buffer_region which is not
defined and will raise AttributeError; fix by either adding an instance method
_legalize_to_buffer_region to this class (mirroring the helper from
mma_macro_generator.TensorCoreIntrinEmitter, importing tvm.ir.Range as Range if
needed) or import and call a module-level _legalize_to_buffer_region helper
directly (adjusting call sites to use the imported function), and ensure any
required imports are added and unit tests updated to exercise the legalization
path.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tilelang/intrinsics/mma_macro_generator.py (1)

803-828: LGTM! Well-designed helper for buffer region normalization.

The _legalize_to_buffer_region method correctly handles all three input types:

  • BufferRegion: Pass-through (no conversion needed)
  • Buffer: Creates a full-region covering the entire shape
  • BufferLoad: Attempts extraction via get_buffer_region_from_load, with a sensible fallback to 1-sized ranges for scalar loads

The implementation aligns perfectly with the PR's objective to standardize buffer/region extraction.

Optional refinement: Consider shortening the error message on line 828 to address the Ruff TRY003 hint, e.g.:

-        raise ValueError(f"Unsupported argument type for BufferRegion: {type(obj)}")
+        raise ValueError(f"Unsupported type: {type(obj)}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c240486 and 53c6e47.

📒 Files selected for processing (1)
  • tilelang/intrinsics/mma_macro_generator.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/intrinsics/mma_macro_generator.py (1)
tilelang/utils/language.py (2)
  • is_fragment (105-116)
  • get_buffer_region_from_load (162-193)
🪛 Ruff (0.14.5)
tilelang/intrinsics/mma_macro_generator.py

828-828: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (3)
tilelang/intrinsics/mma_macro_generator.py (3)

6-8: LGTM! Necessary imports for buffer region normalization.

The new imports support the _legalize_to_buffer_region helper method and align with the PR's goal of standardizing buffer/region extraction.

Also applies to: 15-15


248-251: LGTM! Consistent buffer region normalization in ldmatrix_a.

Both the fp64 path (line 248) and main path (line 299) correctly use _legalize_to_buffer_region to standardize buffer/region handling. The extraction pattern is consistent:

  1. Legalize to BufferRegion
  2. Extract buffer and base indices
  3. Use normalized representation for offset calculations

This ensures uniform region-based access across all code paths.

Also applies to: 299-302


365-368: LGTM! Consistent buffer region normalization in ldmatrix_b.

The changes mirror the pattern in ldmatrix_a, applying the same normalization approach for the B matrix in both fp64 (line 365) and main (line 402) paths. This maintains symmetry and consistency across the entire MMA intrinsic generation pipeline.

Also applies to: 402-405

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/intrinsics/mfma_macro_generator.py (1)

736-809: Remove debug print statement at line 769 and consider applying buffer legalization to preshuffle paths.

The print(self.a_preshuffle) at line 769 in _warp_ldmatrix_a_shared is a debug artifact that should be removed.

More importantly, the preshuffle implementations bypass buffer normalization entirely. The parent ldmatrix_a (line 261) calls _legalize_to_buffer_region() to normalize inputs, but the preshuffle overrides assume inputs are already in the correct form and index them directly (e.g., A_buf[l, r, row, col] at line 774). However, call sites like test_tilelang_gemm_mfma_preshuffle.py:155 pass raw Buffer objects. Either:

  • Apply _legalize_to_buffer_region() at the start of preshuffle ldmatrix_a/ldmatrix_b, or
  • Document and enforce that preshuffle paths require pre-normalized inputs from callers.

Same applies to ldmatrix_b (lines 811–886).

🧹 Nitpick comments (3)
testing/python/issue/test_tilelang_issue_830.py (2)

20-30: Good addition of decorator, but device 0 hardcoding persists.

The @tilelang.testing.requires_cuda decorator properly addresses part of the previous review feedback. However, the hardcoded torch.cuda.set_device(0) was also flagged in the past review—in multi-GPU setups, device 0 may not be available. Consider using a more flexible approach:

 @tilelang.testing.requires_cuda
 def test_empty_kernel_lowering():
     # Ensure a valid CUDA runtime context is current on this thread for the
     # target device before using driver API calls. Without this, calls like
     # cuModuleLoadData can fail with CUDA_ERROR_INVALID_CONTEXT, especially
     # for kernels that don't touch any device memory or streams beforehand
     # (e.g., "empty" kernels) and therefore haven't triggered context
     # creation implicitly.
-    torch.cuda.set_device(0)
+    device = torch.cuda.current_device() if torch.cuda.device_count() > 0 else 0
+    torch.cuda.set_device(device)
     kernel = _empty_kernel()
     kernel()

70-77: Address device 0 hardcoding and consider adding brief comment.

The decorator is appropriate, but torch.cuda.set_device(0) is hardcoded without checking device availability (same concern as line 28). Additionally, test_empty_kernel_lowering includes a detailed 6-line explanation of why the device is set, while this test has none—consider adding at least a brief comment or referencing the earlier explanation for consistency.

Apply this diff to address both concerns:

 @tilelang.testing.requires_cuda
 def test_empty_kernel_with_binding_variants():
-    torch.cuda.set_device(0)
+    # Ensure valid CUDA context (see test_empty_kernel_lowering for details)
+    device = torch.cuda.current_device() if torch.cuda.device_count() > 0 else 0
+    torch.cuda.set_device(device)
     kernel = _empty_kernel_with_binding_variants()
     kernel()
tilelang/intrinsics/mfma_macro_generator.py (1)

661-686: Well-structured buffer-region normalization with one minor type hint gap.

The implementation correctly handles all three input types (Buffer, BufferRegion, BufferLoad) with appropriate fallback logic for scalar BufferLoad cases.

However, note that the method accepts BufferLoad but the public methods ldmatrix_a (line 261) and ldmatrix_b (line 307) only document Buffer | BufferRegion in their type hints. Consider updating those signatures to include BufferLoad for consistency:

-def ldmatrix_a(self, A_local_buf, A_shared_buf: Buffer | BufferRegion, ki, rk=0):
+def ldmatrix_a(self, A_local_buf, A_shared_buf: Buffer | BufferRegion | BufferLoad, ki, rk=0):

Optional: Consider simplifying the error message.

Per static analysis (TRY003), you could move the message into a custom exception or shorten it slightly, though this is a minor style preference.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53c6e47 and a0d9cbd.

📒 Files selected for processing (2)
  • testing/python/issue/test_tilelang_issue_830.py (2 hunks)
  • tilelang/intrinsics/mfma_macro_generator.py (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/issue/test_tilelang_issue_830.py
🧬 Code graph analysis (1)
tilelang/intrinsics/mfma_macro_generator.py (2)
tilelang/intrinsics/utils.py (1)
  • mfma_store_index_map (90-91)
tilelang/language/utils.py (1)
  • region (7-10)
🪛 Ruff (0.14.5)
tilelang/intrinsics/mfma_macro_generator.py

686-686: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (3)
tilelang/intrinsics/mfma_macro_generator.py (3)

5-7: LGTM! Imports support the new buffer-region normalization.

The new imports (BufferLoad, Range, get_buffer_region_from_load) are correctly added to support the _legalize_to_buffer_region helper method introduced in this PR.

Also applies to: 10-12


274-277: LGTM! Correct application of buffer-region legalization.

The code properly calls _legalize_to_buffer_region to normalize the input, then extracts the buffer and base indices from the resulting region. This is consistent with the PR's objective to standardize buffer/region extraction.


320-323: LGTM! Consistent buffer-region legalization in ldmatrix_b.

The implementation mirrors ldmatrix_a and correctly applies the normalization pattern.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tilelang/intrinsics/mfma_macro_generator.py (1)

659-684: Well-structured helper for buffer region normalization.

The implementation correctly handles all three input types with appropriate fallback logic for scalar BufferLoad cases. The docstring clearly explains the behavior.

Optional: Simplify error message per Ruff TRY003.

The error message on line 684 is inline and somewhat long. Consider using a simpler message or extracting it:

-        raise ValueError(f"Unsupported argument type for BufferRegion: {type(obj)}")
+        raise ValueError(f"Expected Buffer, BufferRegion, or BufferLoad, got {type(obj).__name__}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0d9cbd and 66a926b.

📒 Files selected for processing (1)
  • tilelang/intrinsics/mfma_macro_generator.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/intrinsics/mfma_macro_generator.py (4)
tilelang/language/ast/ir.py (1)
  • Range (1716-1728)
tilelang/intrinsics/utils.py (1)
  • mfma_store_index_map (90-91)
tilelang/utils/language.py (1)
  • get_buffer_region_from_load (162-193)
tilelang/language/utils.py (1)
  • region (7-10)
🪛 Ruff (0.14.5)
tilelang/intrinsics/mfma_macro_generator.py

684-684: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (2)
tilelang/intrinsics/mfma_macro_generator.py (2)

5-13: LGTM!

The import additions support the new buffer region legalization logic: tir for IntImm, Range for region construction, BufferLoad for type handling, and get_buffer_region_from_load for BufferLoad conversion. These changes align well with the PR's objective to standardize region extraction.


272-273: LGTM!

The calls to _legalize_to_buffer_region correctly standardize the shared buffer inputs to BufferRegion before extracting buffer and region information. The pattern is consistent across both ldmatrix_a and ldmatrix_b methods.

Also applies to: 318-319

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tilelang/tileop/gemm/gemm_tcgen05.py (1)

97-101: Use idiomatic None check for optional Buffer.

Line 98 uses if mbar == 0: to check for the absence of the mbar buffer. For Buffer objects, the idiomatic check is if not mbar: or if mbar is None:. The == 0 comparison is non-standard and may not behave as expected if the Buffer class implements custom equality operators.

Apply this diff:

-        mbar = self.mbar
-        if mbar == 0:
-            raise ValueError("TCGEN5MMA requires a valid mbarrier")
+        mbar = self.mbar
+        if not mbar:
+            raise ValueError("TCGEN5MMA requires a valid mbarrier")
src/op/gemm_py.h (1)

32-33: Consider preserving optional semantics for mbar_.

The change from std::optional<tir::Buffer> mbar_ to tir::Buffer mbar_ removes explicit optional semantics. Since mbar_ is only conditionally set in the constructor (src/op/gemm_py.cc:83-91) when args.size() > 16, an unset mbar_ will be a default-constructed (null) Buffer.

This makes it harder to distinguish "mbar not provided" from "mbar is null," reducing type safety. The Python code in gemm_tcgen05.py:98 still treats it as optional (if mbar == 0:), suggesting the optional nature is semantically important.

Consider one of these approaches:

  1. Restore std::optional (preferred for type safety):
-  tir::Buffer mbar_; // mbar is optional, only used for TCGEN5MMA
+  std::optional<tir::Buffer> mbar_; // mbar is optional, only used for TCGEN5MMA
  1. Add validation if keeping non-optional:
    Ensure all access sites check if (mbar_.defined()) before use, and update Python comparison from == 0 to proper null checks.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 66a926b and 099aee6.

📒 Files selected for processing (4)
  • src/op/gemm_py.cc (2 hunks)
  • src/op/gemm_py.h (3 hunks)
  • tilelang/tileop/gemm/gemm_base.py (1 hunks)
  • tilelang/tileop/gemm/gemm_tcgen05.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/op/gemm_py.cc (1)
src/op/utils.cc (2)
  • NormalizeToBufferRegion (15-53)
  • NormalizeToBufferRegion (15-15)
src/op/gemm_py.h (1)
src/op/gemm_py.cc (1)
  • GemmPy (53-95)
tilelang/tileop/gemm/gemm_tcgen05.py (2)
tilelang/tileop/gemm/gemm_base.py (3)
  • mbar (127-128)
  • mbarptr (123-124)
  • clear_accum (107-108)
tilelang/tileop/gemm/__init__.py (1)
  • clear_accum (125-126)
🪛 Ruff (0.14.5)
tilelang/tileop/gemm/gemm_tcgen05.py

99-99: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (2)
tilelang/tileop/gemm/gemm_base.py (1)

126-128: LGTM!

The new mbar property follows the established pattern for exposing optional buffer attributes and correctly returns None when the attribute is absent.

src/op/gemm_py.cc (1)

83-91: I'm unable to complete the verification due to technical limitations preventing repository access. The codebase cannot be cloned or queried through any available methods at this time.

Unable to verify mbar argument handling—repository access failure prevents codebase inspection.

The review comment requests verification that all GemmPy call sites have been updated to pass region-compatible BufferLoad expressions rather than raw access_ptr(...) calls for the mbar argument. However, the codebase is currently inaccessible, making it impossible to:

  • Locate and inspect GemmPy instantiation sites
  • Verify how mbar arguments are constructed in Python wrapper code (e.g., gemm_tcgen05.py)
  • Confirm whether the constructor's BufferLoad requirement at line 84 is satisfied by all callers

This verification requires direct code inspection that cannot be performed at this time.

@LeiWang1999
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/op/gemm.cc
Comment thread src/op/gemm_py.cc
@LeiWang1999 LeiWang1999 merged commit f5d9da4 into tile-ai:main Nov 26, 2025
6 checks passed
chengyupku added a commit to tile-ai/tilescale that referenced this pull request Feb 6, 2026
* [Example] Add GQA decoding kernel with varlen page table (#1265)

* [Example] Add page table for gqa decode

* [Example] Page table for varlen decoding

* [Lint]

* [Refactor] Remove redundant code

* [Lint]

* [Lint]

* [Lint]

* [Refactor] add support for numpy dtype conversion (#1255)

* add typing stub for tir.ir

* remove idents

* minor update

* [Refactor] add numpy conversion for dtype

* fix lint error

* remove unused np.float_ in dtype conversion

* fix type in np.int_

* fix typo

* minor fix

* remove debug files

* [EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability (#1148)

* Keep the max of all blocks seen in scores_max for stability

* ruff formatting

* [Docs] Improve Installation Guide (#1270)

* [Docs] Improve installation guide

* address comments

* [Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity (#1269)

* Implement max score retention across blocks in FlashAttention for improved stability

* fix manual pipeline parameters

* Update examples/flash_attention/example_gqa_fwd_varlen.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix typo

* more

* fix a previous typo

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [Bugfix] Fix multiple cg defination when using T.sync_grid (#1272)

* [Minor] Remove from __future__ import annotations for python 3.8 (#1273)

* [BugFix] Adding extra parameters into autotune hashkey (#1274)

* [BugFix] Adding extra parameters into autotune hashkey

* lint

* None check

* check serializable

* Fix various issues under `int64_t` static and dynamic shape. (#1218)

* Fix various issues under int64_t static and dynamic shape.

* Resolve reviewed issues.

* Add unit test.

* fix

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* Bug fix for Gated Delta Net benchmark script (#1267)

* fix argument order for fla chunk_gated_delta_rule_fwd_h

* explicit import assert_similar from utils

* rename utils module to avoid name clash

* set store_final_state and save_new_value to True

* fix

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Bugfix] Minor fix for some cases (#1278)

* [Language] Add shape check in `T.view/reshape` (#1277)

* [Language] Add shape check in T.view/reshape

* address comments

* [FFI] Use tvm ffi as the default execution backend (#1259)

* [Refactor] Update FFI type handling and simplify argument management

* Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
* Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
* Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
* Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
* Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.

* [Update] Sync TVM submodule and enhance kernel source handling

* Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
* Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
* Commented out the main execution call in test files to prevent unintended execution during testing.
* Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
* Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.

* [Refactor] Clean up imports and improve code formatting

* Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
* Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
* Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.

* Update execution backend options and improve resolution logic

- Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
- Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
- Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
- Updated documentation to reflect changes in execution backend options and their defaults.

* lint fix

* fix

* Enhance argument handling in CUDA and HIP runtime modules

- Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
- Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
- Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.

* lint fix

* lint fix

* lint fix

* lint fix

* minor fix

* fix

* recover check

* Refactor argument binding and validation in `arg_binder.cc`

- Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
- Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
- Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
- Minor adjustments in test files to streamline kernel execution and improve readability.

* lint fix

* stride fix

* minor fix

* fix

* lint fix

* lint fix

* Add CUDA stream access policy window helpers and integrate with L2 persistent cache management

- Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
- Updated runtime files to include new FFI packed functions for managing stream attributes.
- Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
- Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.

* check with symbolic

* support null ptr

* Update CMakeLists and lower.py for code generation and subproject status

- Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
- Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
- Marked the TVM subproject as dirty to indicate local modifications.

* lint fix

* Update comments for clarity in quickstart.py

* [Bugfix] Supply missing `T.print` for bool type (#1279)

* fix for bool dtype

* lint fix

* fix

* ci fix

* [Fix] Fix memory leak bug (#1281)

* add typing stub for tir.ir

* remove idents

* minor update

* [Refactor] add numpy conversion for dtype

* fix lint error

* remove unused np.float_ in dtype conversion

* fix type in np.int_

* fix typo

* minor fix

* remove debug files

* fix memory leak bug

* fix lint error

* add comments

* fix lint error

* remove duplicated, because tilelang doesn't dependent deprecated

* [Enhancement] Enhance CUDA compilation by integrating pass context configuration (#1283)

- Updated the `tilelang_callback_cuda_compile` function to accept a `pass_config` parameter, allowing for more flexible compilation options.
- Introduced handling for fast math and PTXAS options based on the provided pass configuration.
- Modified the CUDA build process in `rt_mod_cuda.cc` to utilize the current pass context, improving the integration of compilation settings.
- Refactored NVCC command construction to use a dedicated function for better clarity and maintainability.

* Fix the bug in issue #1266 (#1284)

Co-authored-by: cheeryBloosm <liu_yu_hao@126.com>

* [Language][UX] Nested loop checker in pre-lowering stage (#1288)

* [Language][UX] Nested loop checker in pre-lowering stage

* rename

* comment

* address comments

* [Compatibility] Support CUDA 11.3 (#1290)

* [Feat] Add support for using `T.Tensor(n * 2 + 1)` in function annotation (#1285)

* [Feature] Add support for A: T.Tensor(n + 1) and A: T.Tensor(2*n)

* issue fix

* fix

* fix

* decreate nproc for debugging

---------

Co-authored-by: Lei Wang <leiwang1999@outlook.com>

* [Feat] add support for passing reference in T.Var annotation (#1291)

* [Enhancement] Shared Memory Size Can be Dynamic (#1294)

* bugfix

* lint fix

* test

* lint fix

* increate procs

* recover

* [Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 (#1305)

* [Feat] add missing support of uint32x2

* [Feat] Add `T.Ref` annotation and tests

* fix lint error

* minor update for error message on twice decl

* Remove unused let_bindings_ in CodeGenC to fix #1300

* [Bugfix] Fallback to the old AtomicAdd implementation for legacy architectures (#1306)

* [Fix] Fix frame scope error in T.macro (#1308)

* [Fix] Fix #1307 by adding macro inside function

* fix lint error

* add comments and fix lint error

* Remove debug print from enter_frame method

Removed debug print statement from enter_frame method.

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [WIP] support more dtypes for tcgen05 (#1229)

support ld with pack for fp32 dtype

add dump

add tempalte expand

remove unused dtype and change to rebased apis

* Improve memory access safety and `T.assume` handling (#1292)

* Improve memory access safety and T.assume handling

* Improve memory access safety and T.assume handling

* bugfix

* lint fix

* bugfix

* bugfix

* refactor legalize safe memory access pass

---------

Co-authored-by: Lei Wang <leiwang1999@outlook.com>

* [Bugfix] Fix autotune cache (#1315)

* [Refactor] Backup Analyzer to get the appropriate arith informations (#1311)

* [Refactor] Update Vectorization Functions to Accept Analyzer Parameter

- Modified `VectorizeLoop` and related functions to accept an `arith::Analyzer` parameter, enhancing their capability to perform analysis during vectorization.
- Updated multiple instances in `copy.cc`, `fill.cc`, `parallel.cc`, and layout inference files to utilize the new analyzer parameter for improved performance and correctness.
- Ensured consistency across vectorization logic by integrating the analyzer into existing workflows, facilitating better optimization opportunities.

* [Fix] Corrected PostOrderVisit call in loop_vectorize.cc

- Updated the PostOrderVisit function to analyze the body of the loop node instead of the node itself, ensuring proper handling of nested loops during vectorization analysis.

* fix

* lint fix

* fix

* Revert "[WIP] support more dtypes for tcgen05 (#1229)" (#1323)

This reverts commit 0d101c110f74ebf2ef8c11a5ece9dfb314b48baa.

Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* [CI]: Bump actions/checkout from 5 to 6 (#1319)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [CI]: Bump pypa/cibuildwheel from 3.2 to 3.3 (#1318)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Installation] Fix building using customized TVM path (#1326)

* [Release] Allow developer with write permission to trigger wheel release (#1322)

* [Feat] Support warp reduce (#1316)

* [Feat] Support warp reduce

* lint

* add test

* lint

* [Enhancement] Support more dtype in `T.print` (#1329)

* [Enhancement] Support more dtype in `T.print`

* upd

* upd

* [BugFix] Use BufferRegion in tl.cumsum to infer buffer shape (#1321)

* [BugFix] Use BufferRegion in tl.cumsum to infer buffer shape

* remove debug lines

* remove rubbish

* Fix decorator syntax for atomic_different_memory_orders_program

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Fix] fix wrong uint narrowing bug in tvm in #1310 (#1320)

* [Refactor] Disable strided buffer load inside tvm (#1301) (#1332)

* [Refactor] Moving `NormalizeToBufferRegion` and `MakeAccessPtrFromRegion` to utils (#1333)

* Refactor GEMM and Reduce operations by moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.

* lint fix

* [Fix] Fix bug copying from or to local buffer (#1304) (#1324)

* [Fix] fix copy from or to local buffer (#1304)

* fix lint error

* minor fix testing script

* [Language][UX] Semantic check for parallel fragment access (#1338)

* Add unit tests for T.assume (#1341)

* Add test for T.assume

* Add unit test for T.assume

* Add unit test for T.assume

* Add unit tests for T.assume

* Remove debug print for kernel source

Remove print statement for kernel source in tests.

* Update test_tilelang_language_assume.py

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Feat] Extend LegalizeNegativeIndex to support buffer store stmts (#1339)

This commit enhances the LegalizeNegativeIndex transformation pass to handle
both buffer load and store operations with negative indices and adds some
test cases.

* [Refactor] Phaseout vmap for Tile Operators (#1334)

* Refactor GEMM and Reduce operations by moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.

* lint fix

* Refactor region handling by removing the RegionOp and updating NormalizeToBufferRegion to only accept BufferLoad and BufferRegion. This change improves code organization and simplifies the handling of memory regions across various operations.

* fix

* Refactor memory region handling by introducing `tl.region` calls across various operations, including GEMM and fill functions. This change enhances the consistency of region management and improves code organization by utilizing utility functions for buffer region conversions.

* fix

* fix

* test fix

* lint fix

* Refactor GEMM operations to improve memory region handling by replacing `mbarPtr_` with `mbarRegion_` and updating related logic in both C++ and Python implementations. This change enhances the clarity and consistency of buffer region management.

* fix

* lint fix

* fix

* fix

* test fix

* lint fix

* lint fix

* minor fix

* fix

---------

Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* [Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 (#1327)

* feat: add fp8 variants; add placeholder for fp6/fp4 in meta

support ld with pack for fp32 dtype

add dump

add tempalte expand

remove unused dtype and change to rebased apis

* fix: when atom-m!=128, enable_ws

* fix: typo in tcgen05 meta; dispatch in gemm sm100

* [Refactor] Enhance CopyNode's IterVar Creation and Range Handling (#1346)

* [Refactor] Enhance CopyNode's IterVar Creation and Range Handling

This commit refines the `MakeIterVars` method in `CopyNode` to select base ranges based on memory scope levels, ensuring that the chosen ranges are not smaller than the original source ranges. Additionally, it updates the Python `copy` function to clarify range handling, including broadcasting logic and extent alignment. These changes improve the robustness and clarity of the copy operation's implementation.

* test fix

* [Fix] Fix missing `not` rewrite in frontend (#1348)

* [Enhancement] Add support for k_pack in gemm_mfma (#1344)

* add support for k_pack

* support benchmark on ROCm

* fix format

* Add sparse fine-tuning kernel for deepseek sparse attention to example (#1296)

* [EXAMPLE] add example for dsa sparse finetuning

* [Refactor]

* [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder (#1352)

* [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder

This commit refines the assertion message generation in CodeGenCHost by optimizing the handling of equality checks and reducing buffer size for error messages. Additionally, it enhances the ArgBinder by introducing a nullable guard mechanism for assertions, allowing for more precise error handling when binding arguments. The changes improve the clarity and efficiency of assertion handling across the codebase.

* [Enhancement] Update matmul kernel and optimize argument binding

This commit enhances the matmul kernel by introducing additional tensor parameters and refining the pipeline stages for improved performance. It also updates the argument binding mechanism to include a flag indicating whether buffers are used, enhancing the efficiency of buffer management. Furthermore, the optimization phase in the engine is improved by adding a simplification step, ensuring better performance and clarity in the generated code.

* lint fix

* [Enhancement] Add tensor checks documentation and improve argument binding assertions

This commit introduces a new documentation page for host-side tensor checks, detailing the automatic validations performed by TileLang on kernel arguments. It enhances the ArgBinder by adding assertions for non-null pointers when arguments are used, improving error handling. Additionally, the optimization phase in the engine is updated to include a simplification step, ensuring better performance and clarity in the generated code.

* [Enhancement] Update .gitignore and refine matmul kernel for improved performance

This commit adds host checks logs to the .gitignore file to prevent unnecessary log files from being tracked. Additionally, it refines the matmul kernel by adjusting pipeline stages, updating tensor parameters, and enhancing argument handling for better performance. The changes also include improved error messages in the argument binding process, ensuring clearer diagnostics for users.

* lint fix

* lint fix

* [Refactor] Simplify tensor_null_test function and remove ptr_null_test

This commit refactors the tensor_null_test function by adding a with_bias parameter and removing the ptr_null_test function, which was previously unused. The run_test function is updated to reflect these changes, streamlining the testing process for tensor operations.

* lint fix

* fix

* [Refactor] Simplify index sign state handling in LegalizeNegativeIndex (#1354)

This commit refines the logic for determining the sign state of indices in the LegalizeNegativeIndex transformation. It prioritizes vector patterns, specifically Ramp and Broadcast nodes, to avoid compile-time lane queries. The handling of scalar indices is also streamlined, ensuring clearer diagnostics when non-negativity cannot be proven. These changes enhance the robustness and clarity of index handling in the transformation pass.

* [Enhancement] Improve error handling and assertion messages across runtime and argument binding (#1356)

This commit enhances the error handling mechanisms in the runtime by introducing CPU-safe runtime helpers and refining assertion messages in the CodeGenCHost and ArgBinder. It includes structured packed error messages for various conditions, improving clarity in diagnostics. Additionally, the CMake configuration is updated to always include necessary runtime helpers, ensuring consistent error reporting. The changes aim to provide clearer feedback during runtime errors and improve the overall robustness of the argument binding process.

* [Bugfix] Disable floordiv optimization due to integer overflow risk (#1355)

* disable overflow-prone floordiv optimization in lower_intrin.cc

* disable overflow-prone floordiv optimization in lower_intrin.cc

* [Bugfix] Fix the jit_kernel issue (#1357)

* [Bugfix] Fix the jit_kernel issue

* Update README.md

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Refactor] Update Fragment Indexing in ParallelOpNode's InferLayout Method (#1359)

This commit refines the Fragment creation process in the InferLayout method of ParallelOpNode. It removes the unnecessary forward_index array and utilizes default fragment indexing for consistency with other operations. Additionally, it binds the thread range to enhance comparability across different operations.

* [Analysis] Enhance NestedLoopChecker with tile op cases (#1358)

* [Analysis] Enhance NestedLoopChecker with tile op cases

* fix tileop issue

* [Language] support `T.gemm_sp_v2` on sm80 and sm89 (#1056)

* [misc] add a cpp side wrapper for gemm_sp_py

* [misc] typing

* [IR] bind GemmSPWarpPolicy

* [chore] add wrapper code

* [IR] fix GemmSPWarpPolicy

* [codegen] apply ptxas instructions

* [intrinsic] add typical (unused) mma layout

* [template] add uint16 debug func

* [intrinsic] add b matrix layout

* [gemm_sp] enable fp16/bf16 on sm8x

* [layout] refactor fp16/bf16 layout

* [gemm_sp] enable int8

* [chore] update test case dtype

* [gemm_sp] enable fp32

* [layout] refactor layouts

* [intrinsic] enable ldmatrix for mat A

* [layout] enable ldsm for matrix b

* [layout] add ldmatrix for fp32 and fp8

* [chore] refine

* [chore] refactor

* [chore] add fp8 efactor

* [chore] refactor

* [chore] add remove negative zero util

* [example] add a custom compress kernel

* [chore] minor update

* [test] refactor gemm_sp test

* [refactor] make metadata layout func

* [example] add option for using cutlass layout

* [doc] add a gemm_sp doc

* [doc] minor polish

* [chore] remove unused

* [bugfix] fix non replicate b case

* [test] refactor

* [chore] add a check

* [bugfix] fix util bug

* [wip] init a new test case for v2

* [chore] minor refactor

* [chore] minor update

* [bugfix] enable 16bit rs

* [language] enable rs

* [language] enable gemm_sp_sr

* [language] enable gemm_sp_rr

* [test] enable more tests

* [tvm] update ffi binding

* [chore] remove print

* [chore] fix benchmark script

* [lint] precommit lint

* [chore] apply feedback

* [test] use arch 8.0

* [chore] rollback ::ordered_metadata for backward compatibility

* [bugfix] fix captialized

* [example] keep gemm_sp on hopper

* [test] fix no fp8 normal kernel

* [test] reduce matmul size to satisfy accum error

* [test] use cal_diff for assertion

* [bugfix] expand float8 type

* [lib] add make_int4 for short type

* [language] add transpose E

* [bugfix] fix wrong var

* [format] format

* [chore] refactor binding

* [chore] fix wrong passing var

* [Bugfix] Update TIR registration for GemmSPPy to use tile operation (#1361)

* [Enhancement] Implement dynamic unroll factor in CUDA code generation (#1360)

* [Enhancement] Implement dynamic unroll factor in CUDA code generation

This commit introduces support for specifying a dynamic unroll factor in the CUDA code generation. The `unroll_factor` map is added to store unroll factors for loop variables, allowing for more flexible and optimized loop unrolling. Additionally, the `unroll` function is integrated into the loop language, enabling users to define unroll factors directly in their code. This enhancement improves performance by allowing tailored unrolling strategies based on specific loop characteristics.

* lint fix

* [Bugfix] Correct initialization of non-zero counters in custom compress kernel and update TIR registration for gemm_sp_py to use the correct tile operation

* [CI] [pre-commit.ci] autoupdate (#1362)

updates:
- [github.com/pre-commit/mirrors-clang-format: v21.1.2 → v21.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v21.1.2...v21.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.14.3 → v0.14.7](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.3...v0.14.7)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Bugfix] Remove debug print in PyStmtFunctionVisitor  (#1363)

* [Debug] Always include line info in NVCC command for improved profiling and mapping (#1364)

* [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py (#1365)

* [Enhancement] Add DISABLE_CACHE environment variables (#1368)

* [Refactor]: Remove useless include in atomicadd_vectorize.h (#1371)

* [Refactor] Generalize fp8 process (#1372)

* [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py

* [Enhancement] Extend support for float8 data types in GEMM operations

- Updated GEMM operations to recognize additional float8 data types: `float8_e4m3fn` and `float8_e5m2fnuz`.
- Refactored condition checks in `checkWgmma` methods to simplify float8 type handling.
- Adjusted test cases to ensure compatibility with the new float8 types in tile language examples.

* lint fix

* [Layout] Enhance Free Layout Inference (#1375)

* [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py

* [Enhancement] Extend support for float8 data types in GEMM operations

- Updated GEMM operations to recognize additional float8 data types: `float8_e4m3fn` and `float8_e5m2fnuz`.
- Refactored condition checks in `checkWgmma` methods to simplify float8 type handling.
- Adjusted test cases to ensure compatibility with the new float8 types in tile language examples.

* lint fix

* [Enhancement] Add injective layout detection and exception handling

- Introduced `DetectInjective` method in `FragmentNode` to check for injective layouts.
- Added `LoopLayoutInjectiveException` to handle errors related to non-injective layouts.
- Updated `InferLayout` methods in `ParallelOpNode` to utilize injective checks and log relevant information.
- Refactored layout inference queue management to use `std::deque` for improved performance and added prioritization logic for buffer layouts.

* remove debug print

* remove debug print

* remove debug print

* minor layout fix

* fix for T.view

* [Enhancement] Improve injective layout detection in FragmentNode

- Updated the `DetectInjective` method to handle symbolic dimensions more effectively by introducing a mechanism to collect symbolic shapes and adjust the detection level accordingly.
- Added logging for cases where the layout detection falls back to NoCheck due to symbolic dimensions.
- Minor update to the test file to include the tilelang testing module.

* [Refactor] Simplify layout inference for bulk copy operations

- Removed unnecessary conditions for bulk load/store operations in the layout inference logic.
- Streamlined the handling of layout application for bulk copy instances to enhance clarity and maintainability.

* remove debug print

* [Enhancement] Introduce layout-related exceptions and improve error handling

- Added `LayoutConflictException` and `LoopLayoutInjectiveException` classes for better exception management in layout operations.
- Updated `InferLayout` method in `ParallelOpNode` to throw `LoopLayoutInjectiveException` with detailed error information when injective layout checks fail.
- Removed redundant exception class definitions from `parallel.h` to streamline code organization.

* [Enhancement] Introduce buffer var lca analysis for pass plan buffer allocations (#1376)

* Update submodule TVM to latest commit and add PlanAndUpdateBufferAllocationLocation function to transform module

- Updated the TVM submodule to commit 3a32b763.
- Added a new function `PlanAndUpdateBufferAllocationLocation` in the transform module to facilitate buffer allocation planning within PrimFuncs.

* Refactor buffer allocation code for improved readability and consistency

- Updated formatting and spacing in `plan_update_buffer_allocation_location.cc` for better code clarity.
- Standardized the use of pointer and reference syntax across various class methods.
- Enhanced comments for better understanding of buffer allocation logic.
- Removed unnecessary lines and improved overall code structure.

* Refactor buffer allocation checks for improved clarity

- Replaced size checks with empty checks for `ffi::Array<Buffer>` in `plan_update_buffer_allocation_location.cc` to enhance code readability.
- Updated conditions in multiple methods to use `empty()` instead of comparing size to zero, streamlining the logic.

* [Tool] Provide layout visualization tool (#1353)

* Provide layout visualization tool

Adds a layout visualization tool to TileLang, which helps users understand and debug the layout transformations applied during compilation.

This tool visualizes the memory layout of tensors at different stages of the compilation process, allowing developers to identify potential inefficiencies and optimize their code for better performance.

The visualization can be enabled via a pass config option.

* format

* add layout visual example

* Adds vis extra with matplotlib dependency

* rafactor pass config name

* fix lint

* Enables configurable layout visualization formats

Allows users to specify the output formats (png, pdf, svg) for layout visualization through a pass config option.

This change provides more flexibility in how layout visualizations are generated, allowing users to choose the formats that best suit their needs.

It also fixes a bug where layout visualization was not correctly disabled when the config option was set to "false".

* Adds visual layout inference tool docs

* fix lint

* fix lint

* Rafactor configurable layout visualization formats

* fix lint

* fix typo

* add some comments

* fix lints

* add some warnings for user

* Moves layout visualization

* Refactors layout visualization pass configuration

Updates the layout visualization pass configuration to use boolean flag for enabling and a string for specifying formats.

* Enables multiple layout visualization formats

* Updates layout visualization docs

* Moves layout visualization to analysis

* [Release] Relax constraint of tvm-ffi to compatible version (#1373)

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Language] Tilelang LazyJIT Experimental Version (#1337)

* initial step

* modify builder

* scratch version of new frontend

* write some tests

* add many tests

* add typing stub for tir.ir

* remove idents

* minor update

* minor update

* First version of jitv2 (renamed to LazyJIT)

* fix pre-commit error

* minor fix

* fix lint error

* fix lint error

* Fix conditional check for PrimFunc instance

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Builder] Enhance variable name binding and scope management (#1378)

- Improved handling of TVM Var/Buffer names to prevent out-of-scope errors when reusing Python names across different for-frames.
- Added assertions to ensure variables are defined within the correct control flow frame, enhancing error checking and code reliability.

* [Bugfix] make cuda driver api compat with cuda12/13, along with tests (#1379)

* [Fix] typo in cuda attr (#1380)

* [Bugfix] make cuda driver api compat with cuda12/13, along with tests

* fix typo in cudaDevAttr

* [Language V2] Minor fix for complex annotations (#1381)

* [Release] Bump Version into 0.1.7 (#1377)

* Update VERSION to 0.1.7

* Update Python version in distribution scripts to support CPython 3.9 and log output

* [Typing] Enhance compatibility for advanced typing features in Python (#1382)

- Updated `allocate.py` and `annot.py` to improve compatibility with Python 3.9 and later by conditionally importing advanced typing features such as `TypeVarTuple`, `Unpack`, and `ParamSpec`.
- Added fallback imports from `typing_extensions` for environments using earlier Python versions.
- Improved handling of generic alias detection to ensure consistent behavior across different Python versions.

* [Bugfix][Build] Update CMake configuration to remove project root injection for sys.path (#1385)

* [Build] Update CMake configuration for tilelang_cython_wrapper installation

- Adjusted output directories for the tilelang_cython_wrapper to ensure that development builds place the extension in build/lib.
- Updated installation paths to place the extension in tilelang/lib within the wheel, improving organization and avoiding potential conflicts with other modules.
- Modified the internal library path exposure in env.py to prevent shadowing of common module names, enhancing compatibility and usability in user projects.

* [Build] Standardize output directories for tilelang libraries

- Set output directories for both tilelang and tilelang_module libraries to "${CMAKE_BINARY_DIR}/lib" for consistency in development builds.
- This change enhances organization and ensures that all build artifacts are located in a unified directory structure.

* [BugFix] Fix split kernel layout bug of GQA decode (#1386)

* [BugFix] Fix split kernel layout bug of GQA decode

* [BugFix] Avoid local with Parallel; use robust fragment instead

* [Enhancement] Add debug output methods for Layout and Fragment classes (#1392)

* [Doc] Update logging docs (#1395)

* [Enhancement] Refactor inflight computing to support dynamic pipeline extents (#1399)

* [Build] Update CMake configuration for tilelang_cython_wrapper installation

- Adjusted output directories for the tilelang_cython_wrapper to ensure that development builds place the extension in build/lib.
- Updated installation paths to place the extension in tilelang/lib within the wheel, improving organization and avoiding potential conflicts with other modules.
- Modified the internal library path exposure in env.py to prevent shadowing of common module names, enhancing compatibility and usability in user projects.

* [Build] Standardize output directories for tilelang libraries

- Set output directories for both tilelang and tilelang_module libraries to "${CMAKE_BINARY_DIR}/lib" for consistency in development builds.
- This change enhances organization and ensures that all build artifacts are located in a unified directory structure.

* [Refactor] Update TVM subproject and enhance pipeline loop handling

- Updated the TVM subproject to commit 90581fe9e5287bbcf1844ad14255a1e1e8cdf7f0.
- Added new fields to `PipelineAnnotation` and `RewrittenBlockInfo` structures to track original statement indices and improve async state management.
- Refactored `EmitImpl` and `PopulateWaitCounts` methods to enhance clarity and functionality, including better handling of commit groups and wait counts.
- Simplified access index calculations and strengthened analyzer constraints for loop bounds.

* [Cleanup] Remove license block and unused includes from inject_pipeline.cc

- Eliminated the Apache license block from the top of the file to streamline the code.
- Removed unused include directives for memory and stringstream to enhance code clarity and reduce unnecessary dependencies.

* [Refactor] Enhance transformation pipeline and test execution

- Added an additional Simplify transformation in the InjectSoftwarePipeline to improve optimization.
- Updated the test file to call `test_trival_pipeline()` directly, commenting out the previous main execution for better test isolation.

* [AMD] Fix 3 bugs when build docker on amd mi3x gpu (#1401)

* [Typo] Fix tilelang link in README.md (#1402)

* [Dependency] Update apache-tvm-ffi version to >=0.1.2 (#1400)

* [Dependency] Update apache-tvm-ffi version to >=0.1.2 in project files

* [Dependency] Update subproject commit for TVM to latest version afc07935

* [Enhancement] Add support for optional step parameter in loop constructs

- Updated loop creation functions to accept an optional step parameter, enhancing flexibility in loop definitions.
- Modified ForFrame implementations to utilize the new step parameter across various loop types including serial, parallel, and pipelined loops.
- Adjusted related vectorization transformations to accommodate the step parameter, ensuring consistent behavior in loop vectorization processes.

* lint fix

* [AMD] Enable FA2 fwd on AMD MI300X (#1406)

* enable FA2 on AMD MI300X

* make lint happy

* [TypoFix] fix typo for SM120 (#1408)

* [Doc] Minor documentation update (#1410)

* [Dependency] Add torch-c-dlpack-ext to project requirements (#1403)

* [Dependency] Add torch-c-dlpack-ext to project requirements

* Added torch-c-dlpack-ext to both pyproject.toml and requirements.txt to provide prebuilt torch extensions, which may prevent JIT compilation on first import of TVM FFI.

* [Build] Update manylinux images in project configuration

* Changed the manylinux image for x86_64 from "manylinux2014" to "manylinux_2_28" in both pyproject.toml and the Dockerfile to align with updated standards for compatibility and performance.

* [Build] Update CUDA repository configuration in pyproject.toml

* Changed the package manager command from `yum-config-manager` to `dnf config-manager` for adding the CUDA repository, ensuring compatibility with newer systems.

* fix

* [Build] Update CUDA repository to RHEL 8

* Changed the CUDA repository configuration in both pyproject.toml and the manylinux Dockerfile from RHEL 7 to RHEL 8, ensuring compatibility with newer systems.

* test: run out of space

* use cu130 to reduce size

* upd

* upd comment

* upd

---------

Co-authored-by: Your Name <wenji.yyc@alibaba-inc.com>

* [Dependency] Update TVM subproject to latest commit 2b1ead1a (#1412)

* [Enhancement] Introduce `T.__ldg` (#1414)

* [Enhancement] Add __ldg intrinsic for CUDA read-only cache loads

* Introduced the __ldg intrinsic to enable explicit read-only cached loads from global memory in CUDA.
* Updated the corresponding documentation and added support in both CUDA and HIP code generation.
* Enhanced the Python interface for __ldg to accept BufferLoad and Buffer types, improving usability.

* [Enhancement] Update formatting and linting rules in pyproject.toml; minor test adjustment

* Added new formatting rules in pyproject.toml to enforce consistent code style, including hanging indents and argument splitting.
* Updated test_tilelang_language_intrinsics_codegen.py to improve readability by adding a blank line before the main execution block.
* Refactored error messages in builtin.py for better clarity and consistency, ensuring proper formatting in function definitions and raising ValueErrors.

* lint fix

* [Enhancement] Improve vectorization invariant check (#1398)

* Improve loop vectorize

* Improve loop vectorize

* Improve loop vectorize

* Improve loop vectorize

* Improve loop vectorize

* Add some vectorize tests and comments

* [Lint] Phaseout Yapf format and embrace ruff format (#1417)

* [Atomic] Use ptr for atomicAdd dst instead of reference (#1425)

* [Enhancement] Update AtomicAdd function signature to accept pointer to destination

* Modified AtomicAdd in CUDA to take a pointer instead of a reference for the destination argument.
* Updated related code in atomicadd_vectorize.cc to ensure compatibility with the new signature.
* Adjusted Python interface in atomic.py to pass the destination by pointer, aligning with device function requirements.

* [Enhancement] Refactor AtomicAddRet function signature to accept pointer

* Updated AtomicAddRet in both CUDA and HIP to take a pointer instead of a reference for the address argument, improving consistency with the AtomicAdd function.
* Adjusted the implementation to ensure proper reinterpretation of the address type for atomic operations.

* lint fix

* [Enhancement] Refactor AtomicAddNode::MakeSIMTLoop to use destination pointer

* Updated the MakeSIMTLoop function to build a pointer to the destination element using tvm_access_ptr instead of loading the destination value directly.
* Simplified the handling of source and destination predicates, improving clarity and maintainability of the code.
* Ensured compatibility with the new pointer-based approach for atomic operations.

* lint fix

* test fix

* lint fix

* [CUDA] Add read-only parameter annotation for CUDA codegen (#1416)

* [Enhancement] Add read-only parameter annotation for CUDA codegen

* Introduced the `AnnotateReadOnlyParams` transformation to annotate read-only handle parameters in PrimFuncs, enabling the generation of `const` qualifiers in CUDA codegen.
* Updated `PrintFunctionSignature` and `AddFunction` methods to utilize the new attribute `tl.readonly_param_indices`, enhancing performance by allowing read-only cache loads.
* Modified the optimization pipeline to include the new annotation step, improving the overall efficiency of the code generation process.

* lint fix

* [Dependency] Update apache-tvm-ffi version to >=0.1.3

* Updated the version of apache-tvm-ffi in pyproject.toml, requirements.txt, and requirements-dev.txt to ensure compatibility with the latest features and fixes.
* Made adjustments in CUDA and HIP template files to use `const` qualifiers for global pointer parameters, enhancing code safety and clarity.

* lint fix

* [Enhancement] Refactor ReadWriteMarker for improved parameter handling

* Updated the ReadWriteMarker class to accept a set of parameter or data variables, enhancing its ability to track written variables.
* Introduced a new method, ResolveDataVarFromPtrArg, to resolve underlying buffer data from pointer-like arguments, improving accuracy in identifying written variables.
* Modified the MarkReadOnlyParams function to gather handle parameters and their corresponding buffer data variables, streamlining the process of determining read-only parameters.
* Enhanced the logic for identifying written variables to account for aliased data variables, ensuring comprehensive tracking of modifications.

* lint fix

* Update tma_load function to use const qualifier for global memory pointer

* Changed the parameter type of gmem_ptr in the tma_load function from void* to void const* to enhance type safety and clarity in memory operations.
* This modification ensures that the function correctly handles read-only global memory pointers, aligning with best practices in CUDA programming.

* Remove commented-out code and reorder transformations in OptimizeForTarget function for clarity

* Refactor buffer marking logic in annotate_read_only_params.cc to improve accuracy in identifying written variables. Update OptimizeForTarget function to reorder transformations for better clarity.

* [Refactor] Phase out the primitives folder since its design has been merged into tileop (#1429)

* Phase out primitives

* revert changes

* Refactor GemmWarpPolicy method signature for clarity

Updated the `from_warp_partition` method in the `GemmWarpPolicy` class to return the type `GemmWarpPolicy` instead of a string, enhancing type safety and clarity in the codebase. Removed an unnecessary blank line for improved readability.

* fix

* [CI]: Bump actions/upload-artifact from 5 to 6 (#1431)

Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 6.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [CI]: Bump actions/download-artifact from 6 to 7 (#1432)

Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 6 to 7.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v6...v7)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Bugfix] Convey  `compile_flags` to ffi compilation path with pass_configs (#1434)

* [Enhancement] Add device compile flags support in pass configuration

* Introduced `kDeviceCompileFlags` option in the pass configuration to allow additional device compiler flags for CUDA compilation.
* Updated the `tilelang_callback_cuda_compile` function to merge extra flags from the pass configuration, enhancing flexibility in compiler options.
* Modified the `JITKernel` class to handle device compile flags appropriately, ensuring they are included during compilation.
* Documented the new pass configuration key for clarity on usage and expected input formats.

* lint fix

* [Refactor] Simplify compile_flags handling in JIT functions

* Removed redundant string check for compile_flags in the compile, jit, and lazy_jit functions, ensuring compile_flags is consistently treated as a list.
* Updated the JITKernel class to handle compile_flags as a list when a string is provided, enhancing code clarity and maintainability.

* lint fix

* fix

* [Enhancement] Improve buffer usage tracking in MakePackedAPI (#1435)

* Added detailed logging for data and shape variable parameters during buffer usage detection in the MakePackedAPI function.
* Refactored the UsedBufferDetector to differentiate between used parameters by data and shape variables, enhancing clarity in buffer management.
* Updated logic to ensure minimal carrier buffers are selected for shape symbols, improving the efficiency of parameter handling.

* [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice (#1405)

* [Refactor] Refactor InjectAssumes logic and make assumes work after SplitHostDevice

* address comments

* fix

* fix submodule

* fix

* fix 3rdparty

* [Enhancement] Include PrimFunc name in memory cache logs for better debugging (#1437)

* Added the `get_prim_func_name` utility to extract human-readable function names from TVM PrimFuncs.
* Updated memory cache logging in `AutoTuner` and `KernelCache` classes to include the kernel name, improving clarity during cache hits.
* Enhanced debug logging to provide more informative messages when checking disk cache for kernels.

* [CI] Update lint dependencies and fix lint on trunk (#1433)

* [CI] Update pre-commit hooks

* [Lint] Pass correct `exclude-header-filter` to `clang-tidy`

* [Lint] Download latest `run-clang-tidy` script

* [CI] Show compile commands

* [CI] Add output grouping to GHA

* [Lint] Re-order pre-commit hooks

* [Enhancement] Refactor vectorization checks in loop_vectorize (#1440)

* Introduced a new function, IsExprInvariantInVectorBoundary, to encapsulate the logic for checking if an expression is invariant within vector boundaries, improving code clarity and reusability.
* Updated the existing vectorization logic to utilize this new function, streamlining the process of determining vectorization feasibility based on boundary conditions.
* Enhanced comments for better understanding of the vectorization criteria and mathematical rationale behind the checks.

* Enhance vectorized conversion support (#1438)

* [Feature] Support region as input of T.cumsum (#1426)

* [Feature] Support region as input of T.cumsum

- Extend T.cumsum to accept BufferRegion and BufferLoad inputs in addition to Buffer
- This enables operations on buffer slices/regions like:
  T.cumsum(InputG_fragment[i * chunk_size:(i + 1) * chunk_size], dim=0)
- Update cumsum_fragment to handle region inputs properly
- Add comprehensive tests for 1D and 2D region inputs including normal and reverse modes

Fixes #879

* Fix formatting and add docstring for cumsum_fragment

- Add comprehensive docstring for cumsum_fragment function
- Format code according to ruff style guidelines

* Fix CodeRabbit review issues

- Fix negative dimension bounds check (dim < -len(shape) instead of dim <= -len(shape))
- Add src/dst shape compatibility validation for out-of-place cumsum
- Update copy() type annotation to accept BufferRegion as dst parameter
- Fix test in-place mutation issues by using out-of-place cumsum operations
- Add non-divisible size test cases for tail region coverage

* Fix out-of-bounds access in region tests

- Add bounds clamping using T.min() for chunk_end calculations
- Prevents accessing beyond tensor bounds for non-divisible sizes
- Matches reference implementation behavior
- Fixes both 1D and 2D region test cases

* Fix region test: use simple slice expressions instead of T.min()

- Remove T.min() which cannot be used directly in slice indices
- Use chunk_start + chunk_size form instead
- Rely on system's automatic bounds checking for non-divisible sizes
- Update comments to reflect this approach

* Fix cumsum region: use region extents in lowering and update tests for shared memory

* Simplify fragment scope check using is_fragment()

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Fix] Fix analyzer bind conflicting (#1446)

* [Refactor] Reduce direct dependency on PyTorch due to its limited type support (#1444)

* [Enhancement] Update KernelParam to use tvm.DataType directly and add torch_dtype conversion method

- Changed dtype in KernelParam from torch.dtype to tvm.DataType to support a wider range of data types and prevent information loss during conversions.
- Added a new method, torch_dtype, to convert tvm.DataType back to torch.dtype for tensor creation.
- Updated various adapters to utilize the new torch_dtype method for parameter type conversion during initialization.

* [Enhancement] Refactor CUDA type handling and add support for FP4 and FP8 types

- Renamed functions for clarity: GetFP8Type, GetFP6Type, and GetFP4Type are now GetTileLangFP8Type, GetTileLangFP6Type, and GetTileLangFP4Type respectively.
- Enhanced FP4 type handling to support additional lane sizes (2, 4, 8, 16, 32, 64).
- Updated CUDA code generation to include new FP8 and FP4 types, ensuring proper type handling in PrintType and related functions.
- Introduced new structures for FP8 types in cuda_fp8.h to facilitate better memory management and type packing.
- Added methods in KernelParam and tensor utilities to recognize and handle float4 types, improving compatibility with PyTorch.
- Enhanced logging for debugging purposes in various CUDA functions to track type handling and memory operations more effectively.

* lint fix

* Remove unnecessary logging statements from CUDA code generation and delete obsolete matrix multiplication test file.

* [Enhancement] Add support for FP4 and FP8 types in CUDA code generation

- Enhanced PrintVecElemLoad and PrintVecElemStore functions to handle new FP4 types.
- Updated arg_binder to allow float4 to match int8 at runtime, improving compatibility with PyTorch.
- Modified loop_vectorize to account for buffer dtype lanes in vectorization calculations.
- Refactored tensor type mapping to support new float4 and float8 types, ensuring correct type handling in tensor operations.
- Added tests for FP4 and FP8 copy operations to validate functionality and integration with existing workflows.

---------

Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* [Refactor] Use `pytest.mark.parameterize` to speedup parallel testing (#1447)

* Refactor GEMM tests to use parameterized pytest fixtures

- Converted multiple test cases for GEMM operations in `test_tilelang_tilelibrary_gemm_sp.py` to use `pytest.mark.parametrize` for better maintainability and readability.
- Similar refactoring applied to `test_tilelang_tilelibrary_gemm_sp_v2.py`, consolidating test cases for `run_gemm_ss`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` into parameterized tests.
- This change reduces code duplication and enhances the clarity of test configurations.

* Update testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [Docs] Improve installation instructions for developers (#1450)

* [Feat] Integrate Z3 in TVM Arith Analyzer (#1367)

* [Bugfix] Improve autotune from elementwise_add function in examples (#1445)

* Remove JIT decorator from elementwise_add function in examples

* fix kernel compilation without autotune

* Refactor main function to accept parameters and update tests for autotune option

* Refactor autotune test function for morden style

* [Language] Introduce `T.annotate_restrict_buffers` (#1428)

* [Enhancement] Introduce non-restrict parameter support in code generation

- Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation.
- Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases.
- Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict.
- Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions.
- Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively.

* [Enhancement] Improve HoistNonRestrictParams transformation

- Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement.
- Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters.
- Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels.
- Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process.

* [Refactor] Simplify non-restrict parameter handling in code generation

- Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`.
- Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability.
- Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability.

* [Dependency] Update TVM subproject to latest commit 68aa8461

- Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements.
- Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability.
- Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters.

* fix

* [Analyzer] Require loop extent > 0 when entering loop (#1451)

* Updat ROCm CI to Nightly-ROCm-7.1 (#1449)

* [Enhancement] Update examples and tests for improved type handling functionality (#1448)

* [Enhancement] Update examples and tests for improved type handling and functionality

- Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
- Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
- Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
- Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.

* [Refactor] Update accumulation data type to float32 across examples

- Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
- This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.

* [Refactor] Standardize data type usage across benchmark scripts

- Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
- Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
- Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.

* [Refactor] Standardize data type usage in templates and scripts

- Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
- Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
- This change aims to streamline type handling and improve compatibility with existing workflows.

* [Refactor] Standardize data type usage in examples and benchmarks

- Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
- Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.

* [Refactor] Import dtypes from language.v2 module

- Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
- This change aims to streamline data type management and improve overall code clarity.

* fix

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
- Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
- This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.

* [Refactor] Update data type handling for consistency and clarity

- Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
- Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
- Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
- This refactor aims to streamline data type management and improve overall code clarity and maintainability.

* [Enhancement] Improve data type handling and error messaging

- Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
- Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
- Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.

* [Fix] Correct boolean flag in GEMM SP test case

- Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
- This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.

* [Refactor] Standardize data type usage in various modules

- Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.

* [Refactor] Update argument parsing for data types in benchmarks

- Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
- This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.

* [Refactor] Update data type handling in benchmark and example scripts

- Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
- Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
- This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.

* [Refactor] Fix data type conversion in multiple scripts

- Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
- This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.

* [Refactor] Update float8 data type usage across multiple scripts

- Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
- This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.

* [Refactor] Enhance float8 data type handling in CUDA code generation

- Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
- Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
- Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Streamline float8 data type handling in CUDA and related modules

- Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
- Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Remove unnecessary cache disabling in float8 example script

- Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
- This change enhances clarity and maintainability of the example script without affecting its functionality.

* [Refactor] Update data type usage in debug print tests

- Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
- This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.

* lint fix

* Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples

* Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.

* [Issue Template] Enable blank issues in GitHub issue template(#1453)

* [CI] Moved the clang-tidy step to after pip install (#1456)

* [Bug] Fix tvm build script when patchelf is not found #1459)

* [Analyzer] Fix floordiv & floormod bug in z3 prover (#1458)

* fix floordiv & floormod in z3 prover

* fix lint error

* [Cache] Rename sparse compress cache directory (#1460)

* Enhance cache directory structure by including version information in sparse.py to ensure separate caches for different versions.

* Fix formatting in sparse.py by adding a newline for improved readability and consistency.

* [Language]Adds a random number generation capability through curand_kernel (#1461)

* add curand.{curand_init, curand}

* run format.sh

* add default value for curand_init & add test for curand

* Update testing/python/language/test_rand.py

Remove unused thread binding

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* remove unused library

* enable tilelang cache for testing

* run format.sh

* Revert "run format.sh"

This reverts commit 5afaff782f31cdf653e2c45b469da8dead228b8a.

* Revert "enable tilelang cache for testing"

This reverts commit c277a43e77938bd88d47a108dd1bd65734d4a1ae.

* Revert "remove unused library"

This reverts commit 568ad20611f039380113937fd131151a2bffd801.

* run format.sh

* ensure FreshName for __philox_state

* ensure FreshName for __philox_state
…
chengyupku added a commit to tile-ai/tilescale that referenced this pull request Feb 6, 2026
* Enhance threadblock swizzle templates with default offset parameter and streamline parser.py for better readability

* [Cache] Rename sparse compress cache directory

* Temporarily exclude sink tests from non-distributed example tests in CI to address timeout issues

* [DeepEP] Move deepep benchmark to example and allow compatible with new version DeepEP

* [Feat] Enhance `T.st` to support intra-node store to peer's symm memory

* use strided loop to simplify get_dispatch a bit

* [Feat] Support warp reduce operators

* draft notify dispatch

* rename and refactor `T.barrier/sync_blocks`

* fix prev typo

* [Feat] Add `get_device_tensor` function and related test

* support elect_one_sync() and add test

* draft dispatch

* suupport ld, st, warp_sync, continue and add test

* support warp vote and add test

* support device-side wait_ne

* refactor T.wait_* and refine dispatch test logic

* intra-node dispatch test passed

* draft combine

* support massage-only debug print

* intra-node combine test passed

* unify dispatch, migrate topk_idx to u64, support cached dispatch

* Refactor to pre-alloc buffers and expose interface, add benchmark

* remove redundant test

* update doc

* use int4 vectorization for dispatch

* use comm_stream for comm kernels

* optimze dispatch perf via skipping tensor validation

* add dispatch benchmark result

* make rank as an argument of the kernel

* use cuda postproc for vectorization in combine

* support int4 ld/st ptx in cuda template

* [Feat] Support auto vectorization for ld/st to optimize combine to surpass deepep

* lint

* upd doc

* make ci happy

* fix review issues

* fix import error

* Add DeepEP submodule and installation script for CI

* fix ci bug

* [Sync] Merge mainstream TileLang TVM-FFI features into TileScale (#47)

* [Example] Add GQA decoding kernel with varlen page table (#1265)

* [Example] Add page table for gqa decode

* [Example] Page table for varlen decoding

* [Lint]

* [Refactor] Remove redundant code

* [Lint]

* [Lint]

* [Lint]

* [Refactor] add support for numpy dtype conversion (#1255)

* add typing stub for tir.ir

* remove idents

* minor update

* [Refactor] add numpy conversion for dtype

* fix lint error

* remove unused np.float_ in dtype conversion

* fix type in np.int_

* fix typo

* minor fix

* remove debug files

* [EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability (#1148)

* Keep the max of all blocks seen in scores_max for stability

* ruff formatting

* [Docs] Improve Installation Guide (#1270)

* [Docs] Improve installation guide

* address comments

* [Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity (#1269)

* Implement max score retention across blocks in FlashAttention for improved stability

* fix manual pipeline parameters

* Update examples/flash_attention/example_gqa_fwd_varlen.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix typo

* more

* fix a previous typo

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [Bugfix] Fix multiple cg defination when using T.sync_grid (#1272)

* [Minor] Remove from __future__ import annotations for python 3.8 (#1273)

* [BugFix] Adding extra parameters into autotune hashkey (#1274)

* [BugFix] Adding extra parameters into autotune hashkey

* lint

* None check

* check serializable

* Fix various issues under `int64_t` static and dynamic shape. (#1218)

* Fix various issues under int64_t static and dynamic shape.

* Resolve reviewed issues.

* Add unit test.

* fix

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* Bug fix for Gated Delta Net benchmark script (#1267)

* fix argument order for fla chunk_gated_delta_rule_fwd_h

* explicit import assert_similar from utils

* rename utils module to avoid name clash

* set store_final_state and save_new_value to True

* fix

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Bugfix] Minor fix for some cases (#1278)

* [Language] Add shape check in `T.view/reshape` (#1277)

* [Language] Add shape check in T.view/reshape

* address comments

* [FFI] Use tvm ffi as the default execution backend (#1259)

* [Refactor] Update FFI type handling and simplify argument management

* Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
* Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
* Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
* Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
* Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.

* [Update] Sync TVM submodule and enhance kernel source handling

* Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
* Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
* Commented out the main execution call in test files to prevent unintended execution during testing.
* Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
* Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.

* [Refactor] Clean up imports and improve code formatting

* Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
* Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
* Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.

* Update execution backend options and improve resolution logic

- Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
- Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
- Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
- Updated documentation to reflect changes in execution backend options and their defaults.

* lint fix

* fix

* Enhance argument handling in CUDA and HIP runtime modules

- Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
- Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
- Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.

* lint fix

* lint fix

* lint fix

* lint fix

* minor fix

* fix

* recover check

* Refactor argument binding and validation in `arg_binder.cc`

- Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
- Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
- Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
- Minor adjustments in test files to streamline kernel execution and improve readability.

* lint fix

* stride fix

* minor fix

* fix

* lint fix

* lint fix

* Add CUDA stream access policy window helpers and integrate with L2 persistent cache management

- Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
- Updated runtime files to include new FFI packed functions for managing stream attributes.
- Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
- Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.

* check with symbolic

* support null ptr

* Update CMakeLists and lower.py for code generation and subproject status

- Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
- Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
- Marked the TVM subproject as dirty to indicate local modifications.

* lint fix

* Update comments for clarity in quickstart.py

* [Bugfix] Supply missing `T.print` for bool type (#1279)

* fix for bool dtype

* lint fix

* fix

* ci fix

* [Fix] Fix memory leak bug (#1281)

* add typing stub for tir.ir

* remove idents

* minor update

* [Refactor] add numpy conversion for dtype

* fix lint error

* remove unused np.float_ in dtype conversion

* fix type in np.int_

* fix typo

* minor fix

* remove debug files

* fix memory leak bug

* fix lint error

* add comments

* fix lint error

* remove duplicated, because tilelang doesn't dependent deprecated

* [Enhancement] Enhance CUDA compilation by integrating pass context configuration (#1283)

- Updated the `tilelang_callback_cuda_compile` function to accept a `pass_config` parameter, allowing for more flexible compilation options.
- Introduced handling for fast math and PTXAS options based on the provided pass configuration.
- Modified the CUDA build process in `rt_mod_cuda.cc` to utilize the current pass context, improving the integration of compilation settings.
- Refactored NVCC command construction to use a dedicated function for better clarity and maintainability.

* Fix the bug in issue #1266 (#1284)

Co-authored-by: cheeryBloosm <liu_yu_hao@126.com>

* [Language][UX] Nested loop checker in pre-lowering stage (#1288)

* [Language][UX] Nested loop checker in pre-lowering stage

* rename

* comment

* address comments

* [Compatibility] Support CUDA 11.3 (#1290)

* [Feat] Add support for using `T.Tensor(n * 2 + 1)` in function annotation (#1285)

* [Feature] Add support for A: T.Tensor(n + 1) and A: T.Tensor(2*n)

* issue fix

* fix

* fix

* decreate nproc for debugging

---------

Co-authored-by: Lei Wang <leiwang1999@outlook.com>

* [Feat] add support for passing reference in T.Var annotation (#1291)

* [Enhancement] Shared Memory Size Can be Dynamic (#1294)

* bugfix

* lint fix

* test

* lint fix

* increate procs

* recover

* [Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 (#1305)

* [Feat] add missing support of uint32x2

* [Feat] Add `T.Ref` annotation and tests

* fix lint error

* minor update for error message on twice decl

* Remove unused let_bindings_ in CodeGenC to fix #1300

* [Bugfix] Fallback to the old AtomicAdd implementation for legacy architectures (#1306)

* [Fix] Fix frame scope error in T.macro (#1308)

* [Fix] Fix #1307 by adding macro inside function

* fix lint error

* add comments and fix lint error

* Remove debug print from enter_frame method

Removed debug print statement from enter_frame method.

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [WIP] support more dtypes for tcgen05 (#1229)

support ld with pack for fp32 dtype

add dump

add tempalte expand

remove unused dtype and change to rebased apis

* Improve memory access safety and `T.assume` handling (#1292)

* Improve memory access safety and T.assume handling

* Improve memory access safety and T.assume handling

* bugfix

* lint fix

* bugfix

* bugfix

* refactor legalize safe memory access pass

---------

Co-authored-by: Lei Wang <leiwang1999@outlook.com>

* [Bugfix] Fix autotune cache (#1315)

* [Refactor] Backup Analyzer to get the appropriate arith informations (#1311)

* [Refactor] Update Vectorization Functions to Accept Analyzer Parameter

- Modified `VectorizeLoop` and related functions to accept an `arith::Analyzer` parameter, enhancing their capability to perform analysis during vectorization.
- Updated multiple instances in `copy.cc`, `fill.cc`, `parallel.cc`, and layout inference files to utilize the new analyzer parameter for improved performance and correctness.
- Ensured consistency across vectorization logic by integrating the analyzer into existing workflows, facilitating better optimization opportunities.

* [Fix] Corrected PostOrderVisit call in loop_vectorize.cc

- Updated the PostOrderVisit function to analyze the body of the loop node instead of the node itself, ensuring proper handling of nested loops during vectorization analysis.

* fix

* lint fix

* fix

* Revert "[WIP] support more dtypes for tcgen05 (#1229)" (#1323)

This reverts commit 0d101c110f74ebf2ef8c11a5ece9dfb314b48baa.

Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* [CI]: Bump actions/checkout from 5 to 6 (#1319)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [CI]: Bump pypa/cibuildwheel from 3.2 to 3.3 (#1318)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Installation] Fix building using customized TVM path (#1326)

* [Release] Allow developer with write permission to trigger wheel release (#1322)

* [Feat] Support warp reduce (#1316)

* [Feat] Support warp reduce

* lint

* add test

* lint

* [Enhancement] Support more dtype in `T.print` (#1329)

* [Enhancement] Support more dtype in `T.print`

* upd

* upd

* [BugFix] Use BufferRegion in tl.cumsum to infer buffer shape (#1321)

* [BugFix] Use BufferRegion in tl.cumsum to infer buffer shape

* remove debug lines

* remove rubbish

* Fix decorator syntax for atomic_different_memory_orders_program

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Fix] fix wrong uint narrowing bug in tvm in #1310 (#1320)

* [Refactor] Disable strided buffer load inside tvm (#1301) (#1332)

* [Refactor] Moving `NormalizeToBufferRegion` and `MakeAccessPtrFromRegion` to utils (#1333)

* Refactor GEMM and Reduce operations by moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.

* lint fix

* [Fix] Fix bug copying from or to local buffer (#1304) (#1324)

* [Fix] fix copy from or to local buffer (#1304)

* fix lint error

* minor fix testing script

* [Language][UX] Semantic check for parallel fragment access (#1338)

* Add unit tests for T.assume (#1341)

* Add test for T.assume

* Add unit test for T.assume

* Add unit test for T.assume

* Add unit tests for T.assume

* Remove debug print for kernel source

Remove print statement for kernel source in tests.

* Update test_tilelang_language_assume.py

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Feat] Extend LegalizeNegativeIndex to support buffer store stmts (#1339)

This commit enhances the LegalizeNegativeIndex transformation pass to handle
both buffer load and store operations with negative indices and adds some
test cases.

* [Refactor] Phaseout vmap for Tile Operators (#1334)

* Refactor GEMM and Reduce operations by moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.

* lint fix

* Refactor region handling by removing the RegionOp and updating NormalizeToBufferRegion to only accept BufferLoad and BufferRegion. This change improves code organization and simplifies the handling of memory regions across various operations.

* fix

* Refactor memory region handling by introducing `tl.region` calls across various operations, including GEMM and fill functions. This change enhances the consistency of region management and improves code organization by utilizing utility functions for buffer region conversions.

* fix

* fix

* test fix

* lint fix

* Refactor GEMM operations to improve memory region handling by replacing `mbarPtr_` with `mbarRegion_` and updating related logic in both C++ and Python implementations. This change enhances the clarity and consistency of buffer region management.

* fix

* lint fix

* fix

* fix

* test fix

* lint fix

* lint fix

* minor fix

* fix

---------

Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* [Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 (#1327)

* feat: add fp8 variants; add placeholder for fp6/fp4 in meta

support ld with pack for fp32 dtype

add dump

add tempalte expand

remove unused dtype and change to rebased apis

* fix: when atom-m!=128, enable_ws

* fix: typo in tcgen05 meta; dispatch in gemm sm100

* [Refactor] Enhance CopyNode's IterVar Creation and Range Handling (#1346)

* [Refactor] Enhance CopyNode's IterVar Creation and Range Handling

This commit refines the `MakeIterVars` method in `CopyNode` to select base ranges based on memory scope levels, ensuring that the chosen ranges are not smaller than the original source ranges. Additionally, it updates the Python `copy` function to clarify range handling, including broadcasting logic and extent alignment. These changes improve the robustness and clarity of the copy operation's implementation.

* test fix

* [Fix] Fix missing `not` rewrite in frontend (#1348)

* [Enhancement] Add support for k_pack in gemm_mfma (#1344)

* add support for k_pack

* support benchmark on ROCm

* fix format

* Add sparse fine-tuning kernel for deepseek sparse attention to example (#1296)

* [EXAMPLE] add example for dsa sparse finetuning

* [Refactor]

* [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder (#1352)

* [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder

This commit refines the assertion message generation in CodeGenCHost by optimizing the handling of equality checks and reducing buffer size for error messages. Additionally, it enhances the ArgBinder by introducing a nullable guard mechanism for assertions, allowing for more precise error handling when binding arguments. The changes improve the clarity and efficiency of assertion handling across the codebase.

* [Enhancement] Update matmul kernel and optimize argument binding

This commit enhances the matmul kernel by introducing additional tensor parameters and refining the pipeline stages for improved performance. It also updates the argument binding mechanism to include a flag indicating whether buffers are used, enhancing the efficiency of buffer management. Furthermore, the optimization phase in the engine is improved by adding a simplification step, ensuring better performance and clarity in the generated code.

* lint fix

* [Enhancement] Add tensor checks documentation and improve argument binding assertions

This commit introduces a new documentation page for host-side tensor checks, detailing the automatic validations performed by TileLang on kernel arguments. It enhances the ArgBinder by adding assertions for non-null pointers when arguments are used, improving error handling. Additionally, the optimization phase in the engine is updated to include a simplification step, ensuring better performance and clarity in the generated code.

* [Enhancement] Update .gitignore and refine matmul kernel for improved performance

This commit adds host checks logs to the .gitignore file to prevent unnecessary log files from being tracked. Additionally, it refines the matmul kernel by adjusting pipeline stages, updating tensor parameters, and enhancing argument handling for better performance. The changes also include improved error messages in the argument binding process, ensuring clearer diagnostics for users.

* lint fix

* lint fix

* [Refactor] Simplify tensor_null_test function and remove ptr_null_test

This commit refactors the tensor_null_test function by adding a with_bias parameter and removing the ptr_null_test function, which was previously unused. The run_test function is updated to reflect these changes, streamlining the testing process for tensor operations.

* lint fix

* fix

* [Refactor] Simplify index sign state handling in LegalizeNegativeIndex (#1354)

This commit refines the logic for determining the sign state of indices in the LegalizeNegativeIndex transformation. It prioritizes vector patterns, specifically Ramp and Broadcast nodes, to avoid compile-time lane queries. The handling of scalar indices is also streamlined, ensuring clearer diagnostics when non-negativity cannot be proven. These changes enhance the robustness and clarity of index handling in the transformation pass.

* [Enhancement] Improve error handling and assertion messages across runtime and argument binding (#1356)

This commit enhances the error handling mechanisms in the runtime by introducing CPU-safe runtime helpers and refining assertion messages in the CodeGenCHost and ArgBinder. It includes structured packed error messages for various conditions, improving clarity in diagnostics. Additionally, the CMake configuration is updated to always include necessary runtime helpers, ensuring consistent error reporting. The changes aim to provide clearer feedback during runtime errors and improve the overall robustness of the argument binding process.

* [Bugfix] Disable floordiv optimization due to integer overflow risk (#1355)

* disable overflow-prone floordiv optimization in lower_intrin.cc

* disable overflow-prone floordiv optimization in lower_intrin.cc

* [Bugfix] Fix the jit_kernel issue (#1357)

* [Bugfix] Fix the jit_kernel issue

* Update README.md

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Refactor] Update Fragment Indexing in ParallelOpNode's InferLayout Method (#1359)

This commit refines the Fragment creation process in the InferLayout method of ParallelOpNode. It removes the unnecessary forward_index array and utilizes default fragment indexing for consistency with other operations. Additionally, it binds the thread range to enhance comparability across different operations.

* [Analysis] Enhance NestedLoopChecker with tile op cases (#1358)

* [Analysis] Enhance NestedLoopChecker with tile op cases

* fix tileop issue

* [Language] support `T.gemm_sp_v2` on sm80 and sm89 (#1056)

* [misc] add a cpp side wrapper for gemm_sp_py

* [misc] typing

* [IR] bind GemmSPWarpPolicy

* [chore] add wrapper code

* [IR] fix GemmSPWarpPolicy

* [codegen] apply ptxas instructions

* [intrinsic] add typical (unused) mma layout

* [template] add uint16 debug func

* [intrinsic] add b matrix layout

* [gemm_sp] enable fp16/bf16 on sm8x

* [layout] refactor fp16/bf16 layout

* [gemm_sp] enable int8

* [chore] update test case dtype

* [gemm_sp] enable fp32

* [layout] refactor layouts

* [intrinsic] enable ldmatrix for mat A

* [layout] enable ldsm for matrix b

* [layout] add ldmatrix for fp32 and fp8

* [chore] refine

* [chore] refactor

* [chore] add fp8 efactor

* [chore] refactor

* [chore] add remove negative zero util

* [example] add a custom compress kernel

* [chore] minor update

* [test] refactor gemm_sp test

* [refactor] make metadata layout func

* [example] add option for using cutlass layout

* [doc] add a gemm_sp doc

* [doc] minor polish

* [chore] remove unused

* [bugfix] fix non replicate b case

* [test] refactor

* [chore] add a check

* [bugfix] fix util bug

* [wip] init a new test case for v2

* [chore] minor refactor

* [chore] minor update

* [bugfix] enable 16bit rs

* [language] enable rs

* [language] enable gemm_sp_sr

* [language] enable gemm_sp_rr

* [test] enable more tests

* [tvm] update ffi binding

* [chore] remove print

* [chore] fix benchmark script

* [lint] precommit lint

* [chore] apply feedback

* [test] use arch 8.0

* [chore] rollback ::ordered_metadata for backward compatibility

* [bugfix] fix captialized

* [example] keep gemm_sp on hopper

* [test] fix no fp8 normal kernel

* [test] reduce matmul size to satisfy accum error

* [test] use cal_diff for assertion

* [bugfix] expand float8 type

* [lib] add make_int4 for short type

* [language] add transpose E

* [bugfix] fix wrong var

* [format] format

* [chore] refactor binding

* [chore] fix wrong passing var

* [Bugfix] Update TIR registration for GemmSPPy to use tile operation (#1361)

* [Enhancement] Implement dynamic unroll factor in CUDA code generation (#1360)

* [Enhancement] Implement dynamic unroll factor in CUDA code generation

This commit introduces support for specifying a dynamic unroll factor in the CUDA code generation. The `unroll_factor` map is added to store unroll factors for loop variables, allowing for more flexible and optimized loop unrolling. Additionally, the `unroll` function is integrated into the loop language, enabling users to define unroll factors directly in their code. This enhancement improves performance by allowing tailored unrolling strategies based on specific loop characteristics.

* lint fix

* [Bugfix] Correct initialization of non-zero counters in custom compress kernel and update TIR registration for gemm_sp_py to use the correct tile operation

* [CI] [pre-commit.ci] autoupdate (#1362)

updates:
- [github.com/pre-commit/mirrors-clang-format: v21.1.2 → v21.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v21.1.2...v21.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.14.3 → v0.14.7](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.3...v0.14.7)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Bugfix] Remove debug print in PyStmtFunctionVisitor  (#1363)

* [Debug] Always include line info in NVCC command for improved profiling and mapping (#1364)

* [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py (#1365)

* [Enhancement] Add DISABLE_CACHE environment variables (#1368)

* [Refactor]: Remove useless include in atomicadd_vectorize.h (#1371)

* [Refactor] Generalize fp8 process (#1372)

* [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py

* [Enhancement] Extend support for float8 data types in GEMM operations

- Updated GEMM operations to recognize additional float8 data types: `float8_e4m3fn` and `float8_e5m2fnuz`.
- Refactored condition checks in `checkWgmma` methods to simplify float8 type handling.
- Adjusted test cases to ensure compatibility with the new float8 types in tile language examples.

* lint fix

* [Layout] Enhance Free Layout Inference (#1375)

* [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py

* [Enhancement] Extend support for float8 data types in GEMM operations

- Updated GEMM operations to recognize additional float8 data types: `float8_e4m3fn` and `float8_e5m2fnuz`.
- Refactored condition checks in `checkWgmma` methods to simplify float8 type handling.
- Adjusted test cases to ensure compatibility with the new float8 types in tile language examples.

* lint fix

* [Enhancement] Add injective layout detection and exception handling

- Introduced `DetectInjective` method in `FragmentNode` to check for injective layouts.
- Added `LoopLayoutInjectiveException` to handle errors related to non-injective layouts.
- Updated `InferLayout` methods in `ParallelOpNode` to utilize injective checks and log relevant information.
- Refactored layout inference queue management to use `std::deque` for improved performance and added prioritization logic for buffer layouts.

* remove debug print

* remove debug print

* remove debug print

* minor layout fix

* fix for T.view

* [Enhancement] Improve injective layout detection in FragmentNode

- Updated the `DetectInjective` method to handle symbolic dimensions more effectively by introducing a mechanism to collect symbolic shapes and adjust the detection level accordingly.
- Added logging for cases where the layout detection falls back to NoCheck due to symbolic dimensions.
- Minor update to the test file to include the tilelang testing module.

* [Refactor] Simplify layout inference for bulk copy operations

- Removed unnecessary conditions for bulk load/store operations in the layout inference logic.
- Streamlined the handling of layout application for bulk copy instances to enhance clarity and maintainability.

* remove debug print

* [Enhancement] Introduce layout-related exceptions and improve error handling

- Added `LayoutConflictException` and `LoopLayoutInjectiveException` classes for better exception management in layout operations.
- Updated `InferLayout` method in `ParallelOpNode` to throw `LoopLayoutInjectiveException` with detailed error information when injective layout checks fail.
- Removed redundant exception class definitions from `parallel.h` to streamline code organization.

* [Enhancement] Introduce buffer var lca analysis for pass plan buffer allocations (#1376)

* Update submodule TVM to latest commit and add PlanAndUpdateBufferAllocationLocation function to transform module

- Updated the TVM submodule to commit 3a32b763.
- Added a new function `PlanAndUpdateBufferAllocationLocation` in the transform module to facilitate buffer allocation planning within PrimFuncs.

* Refactor buffer allocation code for improved readability and consistency

- Updated formatting and spacing in `plan_update_buffer_allocation_location.cc` for better code clarity.
- Standardized the use of pointer and reference syntax across various class methods.
- Enhanced comments for better understanding of buffer allocation logic.
- Removed unnecessary lines and improved overall code structure.

* Refactor buffer allocation checks for improved clarity

- Replaced size checks with empty checks for `ffi::Array<Buffer>` in `plan_update_buffer_allocation_location.cc` to enhance code readability.
- Updated conditions in multiple methods to use `empty()` instead of comparing size to zero, streamlining the logic.

* [Tool] Provide layout visualization tool (#1353)

* Provide layout visualization tool

Adds a layout visualization tool to TileLang, which helps users understand and debug the layout transformations applied during compilation.

This tool visualizes the memory layout of tensors at different stages of the compilation process, allowing developers to identify potential inefficiencies and optimize their code for better performance.

The visualization can be enabled via a pass config option.

* format

* add layout visual example

* Adds vis extra with matplotlib dependency

* rafactor pass config name

* fix lint

* Enables configurable layout visualization formats

Allows users to specify the output formats (png, pdf, svg) for layout visualization through a pass config option.

This change provides more flexibility in how layout visualizations are generated, allowing users to choose the formats that best suit their needs.

It also fixes a bug where layout visualization was not correctly disabled when the config option was set to "false".

* Adds visual layout inference tool docs

* fix lint

* fix lint

* Rafactor configurable layout visualization formats

* fix lint

* fix typo

* add some comments

* fix lints

* add some warnings for user

* Moves layout visualization

* Refactors layout visualization pass configuration

Updates the layout visualization pass configuration to use boolean flag for enabling and a string for specifying formats.

* Enables multiple layout visualization formats

* Updates layout visualization docs

* Moves layout visualization to analysis

* [Release] Relax constraint of tvm-ffi to compatible version (#1373)

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Language] Tilelang LazyJIT Experimental Version (#1337)

* initial step

* modify builder

* scratch version of new frontend

* write some tests

* add many tests

* add typing stub for tir.ir

* remove idents

* minor update

* minor update

* First version of jitv2 (renamed to LazyJIT)

* fix pre-commit error

* minor fix

* fix lint error

* fix lint error

* Fix conditional check for PrimFunc instance

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Builder] Enhance variable name binding and scope management (#1378)

- Improved handling of TVM Var/Buffer names to prevent out-of-scope errors when reusing Python names across different for-frames.
- Added assertions to ensure variables are defined within the correct control flow frame, enhancing error checking and code reliability.

* [Bugfix] make cuda driver api compat with cuda12/13, along with tests (#1379)

* [Fix] typo in cuda attr (#1380)

* [Bugfix] make cuda driver api compat with cuda12/13, along with tests

* fix typo in cudaDevAttr

* [Language V2] Minor fix for complex annotations (#1381)

* [Release] Bump Version into 0.1.7 (#1377)

* Update VERSION to 0.1.7

* Update Python version in distribution scripts to support CPython 3.9 and log output

* [Typing] Enhance compatibility for advanced typing features in Python (#1382)

- Updated `allocate.py` and `annot.py` to improve compatibility with Python 3.9 and later by conditionally importing advanced typing features such as `TypeVarTuple`, `Unpack`, and `ParamSpec`.
- Added fallback imports from `typing_extensions` for environments using earlier Python versions.
- Improved handling of generic alias detection to ensure consistent behavior across different Python versions.

* [Bugfix][Build] Update CMake configuration to remove project root injection for sys.path (#1385)

* [Build] Update CMake configuration for tilelang_cython_wrapper installation

- Adjusted output directories for the tilelang_cython_wrapper to ensure that development builds place the extension in build/lib.
- Updated installation paths to place the extension in tilelang/lib within the wheel, improving organization and avoiding potential conflicts with other modules.
- Modified the internal library path exposure in env.py to prevent shadowing of common module names, enhancing compatibility and usability in user projects.

* [Build] Standardize output directories for tilelang libraries

- Set output directories for both tilelang and tilelang_module libraries to "${CMAKE_BINARY_DIR}/lib" for consistency in development builds.
- This change enhances organization and ensures that all build artifacts are located in a unified directory structure.

* [BugFix] Fix split kernel layout bug of GQA decode (#1386)

* [BugFix] Fix split kernel layout bug of GQA decode

* [BugFix] Avoid local with Parallel; use robust fragment instead

* [Enhancement] Add debug output methods for Layout and Fragment classes (#1392)

* [Doc] Update logging docs (#1395)

* [Enhancement] Refactor inflight computing to support dynamic pipeline extents (#1399)

* [Build] Update CMake configuration for tilelang_cython_wrapper installation

- Adjusted output directories for the tilelang_cython_wrapper to ensure that development builds place the extension in build/lib.
- Updated installation paths to place the extension in tilelang/lib within the wheel, improving organization and avoiding potential conflicts with other modules.
- Modified the internal library path exposure in env.py to prevent shadowing of common module names, enhancing compatibility and usability in user projects.

* [Build] Standardize output directories for tilelang libraries

- Set output directories for both tilelang and tilelang_module libraries to "${CMAKE_BINARY_DIR}/lib" for consistency in development builds.
- This change enhances organization and ensures that all build artifacts are located in a unified directory structure.

* [Refactor] Update TVM subproject and enhance pipeline loop handling

- Updated the TVM subproject to commit 90581fe9e5287bbcf1844ad14255a1e1e8cdf7f0.
- Added new fields to `PipelineAnnotation` and `RewrittenBlockInfo` structures to track original statement indices and improve async state management.
- Refactored `EmitImpl` and `PopulateWaitCounts` methods to enhance clarity and functionality, including better handling of commit groups and wait counts.
- Simplified access index calculations and strengthened analyzer constraints for loop bounds.

* [Cleanup] Remove license block and unused includes from inject_pipeline.cc

- Eliminated the Apache license block from the top of the file to streamline the code.
- Removed unused include directives for memory and stringstream to enhance code clarity and reduce unnecessary dependencies.

* [Refactor] Enhance transformation pipeline and test execution

- Added an additional Simplify transformation in the InjectSoftwarePipeline to improve optimization.
- Updated the test file to call `test_trival_pipeline()` directly, commenting out the previous main execution for better test isolation.

* [AMD] Fix 3 bugs when build docker on amd mi3x gpu (#1401)

* [Typo] Fix tilelang link in README.md (#1402)

* [Dependency] Update apache-tvm-ffi version to >=0.1.2 (#1400)

* [Dependency] Update apache-tvm-ffi version to >=0.1.2 in project files

* [Dependency] Update subproject commit for TVM to latest version afc07935

* [Enhancement] Add support for optional step parameter in loop constructs

- Updated loop creation functions to accept an optional step parameter, enhancing flexibility in loop definitions.
- Modified ForFrame implementations to utilize the new step parameter across various loop types including serial, parallel, and pipelined loops.
- Adjusted related vectorization transformations to accommodate the step parameter, ensuring consistent behavior in loop vectorization processes.

* lint fix

* [AMD] Enable FA2 fwd on AMD MI300X (#1406)

* enable FA2 on AMD MI300X

* make lint happy

* [TypoFix] fix typo for SM120 (#1408)

* [Doc] Minor documentation update (#1410)

* [Dependency] Add torch-c-dlpack-ext to project requirements (#1403)

* [Dependency] Add torch-c-dlpack-ext to project requirements

* Added torch-c-dlpack-ext to both pyproject.toml and requirements.txt to provide prebuilt torch extensions, which may prevent JIT compilation on first import of TVM FFI.

* [Build] Update manylinux images in project configuration

* Changed the manylinux image for x86_64 from "manylinux2014" to "manylinux_2_28" in both pyproject.toml and the Dockerfile to align with updated standards for compatibility and performance.

* [Build] Update CUDA repository configuration in pyproject.toml

* Changed the package manager command from `yum-config-manager` to `dnf config-manager` for adding the CUDA repository, ensuring compatibility with newer systems.

* fix

* [Build] Update CUDA repository to RHEL 8

* Changed the CUDA repository configuration in both pyproject.toml and the manylinux Dockerfile from RHEL 7 to RHEL 8, ensuring compatibility with newer systems.

* test: run out of space

* use cu130 to reduce size

* upd

* upd comment

* upd

---------

Co-authored-by: Your Name <wenji.yyc@alibaba-inc.com>

* [Dependency] Update TVM subproject to latest commit 2b1ead1a (#1412)

* [Enhancement] Introduce `T.__ldg` (#1414)

* [Enhancement] Add __ldg intrinsic for CUDA read-only cache loads

* Introduced the __ldg intrinsic to enable explicit read-only cached loads from global memory in CUDA.
* Updated the corresponding documentation and added support in both CUDA and HIP code generation.
* Enhanced the Python interface for __ldg to accept BufferLoad and Buffer types, improving usability.

* [Enhancement] Update formatting and linting rules in pyproject.toml; minor test adjustment

* Added new formatting rules in pyproject.toml to enforce consistent code style, including hanging indents and argument splitting.
* Updated test_tilelang_language_intrinsics_codegen.py to improve readability by adding a blank line before the main execution block.
* Refactored error messages in builtin.py for better clarity and consistency, ensuring proper formatting in function definitions and raising ValueErrors.

* lint fix

* [Enhancement] Improve vectorization invariant check (#1398)

* Improve loop vectorize

* Improve loop vectorize

* Improve loop vectorize

* Improve loop vectorize

* Improve loop vectorize

* Add some vectorize tests and comments

* [Lint] Phaseout Yapf format and embrace ruff format (#1417)

* [Atomic] Use ptr for atomicAdd dst instead of reference (#1425)

* [Enhancement] Update AtomicAdd function signature to accept pointer to destination

* Modified AtomicAdd in CUDA to take a pointer instead of a reference for the destination argument.
* Updated related code in atomicadd_vectorize.cc to ensure compatibility with the new signature.
* Adjusted Python interface in atomic.py to pass the destination by pointer, aligning with device function requirements.

* [Enhancement] Refactor AtomicAddRet function signature to accept pointer

* Updated AtomicAddRet in both CUDA and HIP to take a pointer instead of a reference for the address argument, improving consistency with the AtomicAdd function.
* Adjusted the implementation to ensure proper reinterpretation of the address type for atomic operations.

* lint fix

* [Enhancement] Refactor AtomicAddNode::MakeSIMTLoop to use destination pointer

* Updated the MakeSIMTLoop function to build a pointer to the destination element using tvm_access_ptr instead of loading the destination value directly.
* Simplified the handling of source and destination predicates, improving clarity and maintainability of the code.
* Ensured compatibility with the new pointer-based approach for atomic operations.

* lint fix

* test fix

* lint fix

* [CUDA] Add read-only parameter annotation for CUDA codegen (#1416)

* [Enhancement] Add read-only parameter annotation for CUDA codegen

* Introduced the `AnnotateReadOnlyParams` transformation to annotate read-only handle parameters in PrimFuncs, enabling the generation of `const` qualifiers in CUDA codegen.
* Updated `PrintFunctionSignature` and `AddFunction` methods to utilize the new attribute `tl.readonly_param_indices`, enhancing performance by allowing read-only cache loads.
* Modified the optimization pipeline to include the new annotation step, improving the overall efficiency of the code generation process.

* lint fix

* [Dependency] Update apache-tvm-ffi version to >=0.1.3

* Updated the version of apache-tvm-ffi in pyproject.toml, requirements.txt, and requirements-dev.txt to ensure compatibility with the latest features and fixes.
* Made adjustments in CUDA and HIP template files to use `const` qualifiers for global pointer parameters, enhancing code safety and clarity.

* lint fix

* [Enhancement] Refactor ReadWriteMarker for improved parameter handling

* Updated the ReadWriteMarker class to accept a set of parameter or data variables, enhancing its ability to track written variables.
* Introduced a new method, ResolveDataVarFromPtrArg, to resolve underlying buffer data from pointer-like arguments, improving accuracy in identifying written variables.
* Modified the MarkReadOnlyParams function to gather handle parameters and their corresponding buffer data variables, streamlining the process of determining read-only parameters.
* Enhanced the logic for identifying written variables to account for aliased data variables, ensuring comprehensive tracking of modifications.

* lint fix

* Update tma_load function to use const qualifier for global memory pointer

* Changed the parameter type of gmem_ptr in the tma_load function from void* to void const* to enhance type safety and clarity in memory operations.
* This modification ensures that the function correctly handles read-only global memory pointers, aligning with best practices in CUDA programming.

* Remove commented-out code and reorder transformations in OptimizeForTarget function for clarity

* Refactor buffer marking logic in annotate_read_only_params.cc to improve accuracy in identifying written variables. Update OptimizeForTarget function to reorder transformations for better clarity.

* [Refactor] Phase out the primitives folder since its design has been merged into tileop (#1429)

* Phase out primitives

* revert changes

* Refactor GemmWarpPolicy method signature for clarity

Updated the `from_warp_partition` method in the `GemmWarpPolicy` class to return the type `GemmWarpPolicy` instead of a string, enhancing type safety and clarity in the codebase. Removed an unnecessary blank line for improved readability.

* fix

* [CI]: Bump actions/upload-artifact from 5 to 6 (#1431)

Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 6.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [CI]: Bump actions/download-artifact from 6 to 7 (#1432)

Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 6 to 7.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v6...v7)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Bugfix] Convey  `compile_flags` to ffi compilation path with pass_configs (#1434)

* [Enhancement] Add device compile flags support in pass configuration

* Introduced `kDeviceCompileFlags` option in the pass configuration to allow additional device compiler flags for CUDA compilation.
* Updated the `tilelang_callback_cuda_compile` function to merge extra flags from the pass configuration, enhancing flexibility in compiler options.
* Modified the `JITKernel` class to handle device compile flags appropriately, ensuring they are included during compilation.
* Documented the new pass configuration key for clarity on usage and expected input formats.

* lint fix

* [Refactor] Simplify compile_flags handling in JIT functions

* Removed redundant string check for compile_flags in the compile, jit, and lazy_jit functions, ensuring compile_flags is consistently treated as a list.
* Updated the JITKernel class to handle compile_flags as a list when a string is provided, enhancing code clarity and maintainability.

* lint fix

* fix

* [Enhancement] Improve buffer usage tracking in MakePackedAPI (#1435)

* Added detailed logging for data and shape variable parameters during buffer usage detection in the MakePackedAPI function.
* Refactored the UsedBufferDetector to differentiate between used parameters by data and shape variables, enhancing clarity in buffer management.
* Updated logic to ensure minimal carrier buffers are selected for shape symbols, improving the efficiency of parameter handling.

* [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice (#1405)

* [Refactor] Refactor InjectAssumes logic and make assumes work after SplitHostDevice

* address comments

* fix

* fix submodule

* fix

* fix 3rdparty

* [Enhancement] Include PrimFunc name in memory cache logs for better debugging (#1437)

* Added the `get_prim_func_name` utility to extract human-readable function names from TVM PrimFuncs.
* Updated memory cache logging in `AutoTuner` and `KernelCache` classes to include the kernel name, improving clarity during cache hits.
* Enhanced debug logging to provide more informative messages when checking disk cache for kernels.

* [CI] Update lint dependencies and fix lint on trunk (#1433)

* [CI] Update pre-commit hooks

* [Lint] Pass correct `exclude-header-filter` to `clang-tidy`

* [Lint] Download latest `run-clang-tidy` script

* [CI] Show compile commands

* [CI] Add output grouping to GHA

* [Lint] Re-order pre-commit hooks

* [Enhancement] Refactor vectorization checks in loop_vectorize (#1440)

* Introduced a new function, IsExprInvariantInVectorBoundary, to encapsulate the logic for checking if an expression is invariant within vector boundaries, improving code clarity and reusability.
* Updated the existing vectorization logic to utilize this new function, streamlining the process of determining vectorization feasibility based on boundary conditions.
* Enhanced comments for better understanding of the vectorization criteria and mathematical rationale behind the checks.

* Enhance vectorized conversion support (#1438)

* [Feature] Support region as input of T.cumsum (#1426)

* [Feature] Support region as input of T.cumsum

- Extend T.cumsum to accept BufferRegion and BufferLoad inputs in addition to Buffer
- This enables operations on buffer slices/regions like:
  T.cumsum(InputG_fragment[i * chunk_size:(i + 1) * chunk_size], dim=0)
- Update cumsum_fragment to handle region inputs properly
- Add comprehensive tests for 1D and 2D region inputs including normal and reverse modes

Fixes #879

* Fix formatting and add docstring for cumsum_fragment

- Add comprehensive docstring for cumsum_fragment function
- Format code according to ruff style guidelines

* Fix CodeRabbit review issues

- Fix negative dimension bounds check (dim < -len(shape) instead of dim <= -len(shape))
- Add src/dst shape compatibility validation for out-of-place cumsum
- Update copy() type annotation to accept BufferRegion as dst parameter
- Fix test in-place mutation issues by using out-of-place cumsum operations
- Add non-divisible size test cases for tail region coverage

* Fix out-of-bounds access in region tests

- Add bounds clamping using T.min() for chunk_end calculations
- Prevents accessing beyond tensor bounds for non-divisible sizes
- Matches reference implementation behavior
- Fixes both 1D and 2D region test cases

* Fix region test: use simple slice expressions instead of T.min()

- Remove T.min() which cannot be used directly in slice indices
- Use chunk_start + chunk_size form instead
- Rely on system's automatic bounds checking for non-divisible sizes
- Update comments to reflect this approach

* Fix cumsum region: use region extents in lowering and update tests for shared memory

* Simplify fragment scope check using is_fragment()

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* [Fix] Fix analyzer bind conflicting (#1446)

* [Refactor] Reduce direct dependency on PyTorch due to its limited type support (#1444)

* [Enhancement] Update KernelParam to use tvm.DataType directly and add torch_dtype conversion method

- Changed dtype in KernelParam from torch.dtype to tvm.DataType to support a wider range of data types and prevent information loss during conversions.
- Added a new method, torch_dtype, to convert tvm.DataType back to torch.dtype for tensor creation.
- Updated various adapters to utilize the new torch_dtype method for parameter type conversion during initialization.

* [Enhancement] Refactor CUDA type handling and add support for FP4 and FP8 types

- Renamed functions for clarity: GetFP8Type, GetFP6Type, and GetFP4Type are now GetTileLangFP8Type, GetTileLangFP6Type, and GetTileLangFP4Type respectively.
- Enhanced FP4 type handling to support additional lane sizes (2, 4, 8, 16, 32, 64).
- Updated CUDA code generation to include new FP8 and FP4 types, ensuring proper type handling in PrintType and related functions.
- Introduced new structures for FP8 types in cuda_fp8.h to facilitate better memory management and type packing.
- Added methods in KernelParam and tensor utilities to recognize and handle float4 types, improving compatibility with PyTorch.
- Enhanced logging for debugging purposes in various CUDA functions to track type handling and memory operations more effectively.

* lint fix

* Remove unnecessary logging statements from CUDA code generation and delete obsolete matrix multiplication test file.

* [Enhancement] Add support for FP4 and FP8 types in CUDA code generation

- Enhanced PrintVecElemLoad and PrintVecElemStore functions to handle new FP4 types.
- Updated arg_binder to allow float4 to match int8 at runtime, improving compatibility with PyTorch.
- Modified loop_vectorize to account for buffer dtype lanes in vectorization calculations.
- Refactored tensor type mapping to support new float4 and float8 types, ensuring correct type handling in tensor operations.
- Added tests for FP4 and FP8 copy operations to validate functionality and integration with existing workflows.

---------

Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* [Refactor] Use `pytest.mark.parameterize` to speedup parallel testing (#1447)

* Refactor GEMM tests to use parameterized pytest fixtures

- Converted multiple test cases for GEMM operations in `test_tilelang_tilelibrary_gemm_sp.py` to use `pytest.mark.parametrize` for better maintainability and readability.
- Similar refactoring applied to `test_tilelang_tilelibrary_gemm_sp_v2.py`, consolidating test cases for `run_gemm_ss`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` into parameterized tests.
- This change reduces code duplication and enhances the clarity of test configurations.

* Update testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [Docs] Improve installation instructions for developers (#1450)

* [Feat] Integrate Z3 in TVM Arith Analyzer (#1367)

* [Bugfix] Improve autotune from elementwise_add function in examples (#1445)

* Remove JIT decorator from elementwise_add function in examples

* fix kernel compilation without autotune

* Refactor main function to accept parameters and update tests for autotune option

* Refactor autotune test function for morden style

* [Language] Introduce `T.annotate_restrict_buffers` (#1428)

* [Enhancement] Introduce non-restrict parameter support in code generation

- Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation.
- Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases.
- Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict.
- Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions.
- Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively.

* [Enhancement] Improve HoistNonRestrictParams transformation

- Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement.
- Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters.
- Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels.
- Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process.

* [Refactor] Simplify non-restrict parameter handling in code generation

- Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`.
- Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability.
- Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability.

* [Dependency] Update TVM subproject to latest commit 68aa8461

- Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements.
- Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability.
- Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters.

* fix

* [Analyzer] Require loop extent > 0 when entering loop (#1451)

* Updat ROCm CI to Nightly-ROCm-7.1 (#1449)

* [Enhancement] Update examples and tests for improved type handling functionality (#1448)

* [Enhancement] Update examples and tests for improved type handling and functionality

- Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
- Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
- Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
- Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.

* [Refactor] Update accumulation data type to float32 across examples

- Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
- This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.

* [Refactor] Standardize data type usage across benchmark scripts

- Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
- Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
- Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.

* [Refactor] Standardize data type usage in templates and scripts

- Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
- Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
- This change aims to streamline type handling and improve compatibility with existing workflows.

* [Refactor] Standardize data type usage in examples and benchmarks

- Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
- Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.

* [Refactor] Import dtypes from language.v2 module

- Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
- This change aims to streamline data type management and improve overall code clarity.

* fix

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
- Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
- This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.

* [Refactor] Update data type handling for consistency and clarity

- Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
- Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
- Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
- This refactor aims to streamline data type management and improve overall code clarity and maintainability.

* [Enhancement] Improve data type handling and error messaging

- Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
- Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
- Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.

* [Fix] Correct boolean flag in GEMM SP test case

- Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
- This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.

* [Refactor] Standardize data type usage in various modules

- Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.

* [Refactor] Update argument parsing for data types in benchmarks

- Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
- This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.

* [Refactor] Update data type handling in benchmark and example scripts

- Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
- Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
- This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.

* [Refactor] Fix data type conversion in multiple scripts

- Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
- This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.

* [Refactor] Update float8 data type usage across multiple scripts

- Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
- This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.

* [Refactor] Enhance float8 data type handling in CUDA code generation

- Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
- Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
- Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Streamline float8 data type handling in CUDA and related modules

- Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
- Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Remove unnecessary cache disabling in float8 example script

- Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
- This change enhances clarity and maintainability of the example script without affecting its functionality.

* [Refactor] Update data type usage in debug print tests

- Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
- This update…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants