[Backend] Add metal backend by oraluben · Pull Request #799 · tile-ai/tilelang

oraluben · 2025-09-10T09:44:05Z

Add metal backend via tvm's metal codegen and torch's torch.mps.compile_shader, and create a new torch backend, who only supports MPS for now. Some tasks are left for future works.

Performance numbers:

PyTorch's MPS backend uses matrixMultiplicationWithPrimaryTensor by default, which is faster than default implementation.
For pytorch's native MPS mm impl (enable with PYTORCH_MPS_PREFER_METAL=1), we can observe significant speedup.

In general, tilelang is 2-3x faster than torch native impl and 2-3x slower than MPSGraph impl. tilelang impl could be optimized via implementating T.gemm by matrixMultiplicationWithPrimaryTensor.

Outliners in the table below seem to be error and disappear after rerun.

config                     tl_run1    tl_run2        torch    torch_native
----------------------  ----------  ---------  -----------  --------------
1024_1024_1024,float32    0.203224   0.119603    0.0732187        0.454423
1024_1024_1024,float16    0.118059   0.171631    0.0509571        0.329795
1024_1024_2048,float32    0.21623    0.277052    0.0956315        1.01536
1024_1024_2048,float16    0.163326   0.245721    0.0952914        0.738723
1024_1024_4096,float32    0.364069   0.58848     0.18605          2.02241
1024_1024_4096,float16    0.321528   0.519458    0.164556         1.51384
1024_2048_1024,float32    0.17366    0.236656    0.0963473        0.991771
1024_2048_1024,float16    0.174649   0.251009    0.0902327        0.663332
1024_2048_2048,float32    0.350103   0.55103     0.177765         1.99991
1024_2048_2048,float16    0.322436   0.529719    0.160975         1.3848
1024_2048_4096,float32    0.711705   1.1608      0.3412           3.81918
1024_2048_4096,float16    0.649519   1.15448     0.308582         2.75274
1024_4096_1024,float32    0.348693   0.502921    0.17556          1.92968
1024_4096_1024,float16    0.320613   0.553743    0.162928         1.35127
1024_4096_2048,float32    0.705229   1.10684     0.335839         3.74767
1024_4096_2048,float16    0.634583   1.01857     0.305798         2.6813
1024_4096_4096,float32    1.42189    2.32234     0.657488         7.39324
1024_4096_4096,float16    1.23835    2.05512     0.589203         5.62581
2048_1024_1024,float32    0.172703   0.229385    0.0933813        0.96074
2048_1024_1024,float16    0.23833    0.337618    0.0884824        0.637932
2048_1024_2048,float32    0.349552   0.551636    0.174772         2.06767
2048_1024_2048,float16    0.309059   0.555771    0.159407         1.44814
2048_1024_4096,float32    0.705804   1.17274     0.341789         3.78212
2048_1024_4096,float16    0.623397   1.36904     0.30442          2.81867
2048_2048_1024,float32    0.340474   0.526456    0.176063         2.03803
2048_2048_1024,float16    0.474886   0.787487    0.159341         1.40929
2048_2048_2048,float32    0.702884   1.19228     0.347602       264.746
2048_2048_2048,float16    0.617617   1.03019     0.316195         2.74656
2048_2048_4096,float32    1.42331    2.59355     0.673089         7.18734
2048_2048_4096,float16    1.25659    2.48737     0.587454         5.39229
2048_4096_1024,float32    0.698205   1.09426     0.342761         3.74332
2048_4096_1024,float16    0.610077   1.15637     0.302213         3.20815
2048_4096_2048,float32    1.41329    2.46777     0.670037         7.25803
2048_4096_2048,float16    1.39938    2.26266     0.5874           5.24247
2048_4096_4096,float32    2.96685    3.01786     1.3243         676.942
2048_4096_4096,float16    2.55003    2.4255      1.16313          6.98329
4096_1024_1024,float32    0.346517   0.346101  267.917            1.24304
4096_1024_1024,float16    0.301433   0.302083    0.157588         0.881915
4096_1024_2048,float32    0.921969   0.691183    0.330206         2.44251
4096_1024_2048,float16    1.25368    0.609252    0.613118         1.76924
4096_1024_4096,float32    2.56901    1.37706     1.2552           4.88777
4096_1024_4096,float16    2.15145    1.21733     1.08383          3.47484
4096_2048_1024,float32    1.62391    0.675179    0.539141         2.43388
4096_2048_1024,float16    1.08711    0.599315    0.604424         1.7381
4096_2048_2048,float32    2.83558    1.36514     1.05902          4.84539
4096_2048_2048,float16    4.39186    1.21936     1.17894          3.50558
4096_2048_4096,float32   13.5588     2.83373     4.48778          9.69309
4096_2048_4096,float16    4.50082    2.42589     1.63996          6.96601
4096_4096_1024,float32    2.2144     1.36068     1.12703          4.85186
4096_4096_1024,float16  209.909      1.20646     0.910777         3.47439
4096_4096_2048,float32    4.83127    2.76113     2.04769          9.67094
4096_4096_2048,float16    4.50811    2.40558     1.90847          6.92157
4096_4096_4096,float32   10.6532     5.53237     4.63781         19.3358
4096_4096_4096,float16    8.64457    4.83226     4.26353         13.8529

script: https://gist.github.com/oraluben/89e54b0b687a1fa483c38429442a6d1e

TODOs:

Fix hard-coded torch.mps device and unify cuda with torch.accelerator (or we should use cuda by default and do this later?)
Benchmark gemm against pytorch/mlx (need to compile a patched version of pytorch due to a bug)
Fix bug, current add kernel failed when size > 1024 (invalid launch issue of wrong #thread)
Use proper metal/metallib backend
Add mac aarch64 cicd

Future work:

Verify quantization support
Support cache by compile with something like torch extension
waiting for [MPS] Extension segfault after #152210 (and >= 2.8) pytorch/pytorch#163721
~~(performance) Dispatch to MPSGraph (e.g. mm) for supported ops for better performance~~
Implement T.gemm via matrixMultiplicationWithPrimaryTensor

Summary by CodeRabbit

New Features
- Apple Metal (MPS) backend with a Torch-backed execution path and Metal kernel adapter; Metal target exposed broadly.
Improvements
- Automatic Metal detection and macOS-aware build/link behavior.
- Dynamic runtime resolution of prebuilt TVM libraries.
- Centralized device detection, broader backend warnings, and an additional JIT backend option (nvrtc).
Tests
- Added Metal-targeted GEMM verification tests.
Chores
- Metal install script and macOS CI workflow.
Documentation
- README updated announcing Metal support.

coderabbitai · 2025-09-10T09:44:11Z

Walkthrough

Add Apple Metal (MPS) support across build, JIT, adapters, lowering, runtime, tests, CI and packaging; resolve prebuilt TVM library locations at configure time; add device utilities and generalize profiler/autotuner/cache behavior for CUDA and MPS backends.

Changes

Cohort / File(s)	Summary
Build & CI `CMakeLists.txt`, `install_metal.sh`, `.github/workflows/metal_ci.yml`, `setup.py`	Resolve prebuilt TVM libs via `find_library` variables; conditionally collect/link Metal sources; add `install_metal.sh`; add macOS Metal CI workflow; add platform/backend flags and gating in `setup.py`.
New Metal runtime source `src/target/rt_mod_metal.cc`	Add placeholder Metal runtime source file (included when `USE_METAL` enabled).
Targets & device utilities `tilelang/utils/target.py`, `tilelang/utils/device.py`, `tilelang/contrib/cc.py`	Add Metal target detection and availability checks; add `IS_CUDA`/`IS_MPS` and `get_current_device()`; add `is_darwin()` helper.
Lowering / TVM interaction `tilelang/engine/lower.py`	Add Metal branch in `device_codegen_without_compile` calling `tvm.ffi.get_global_func("target.build.metal")`.
JIT core & kernel flow `tilelang/jit/__init__.py`, `tilelang/jit/kernel.py`	Enforce Metal targets require `execution_backend="torch"`; expand allowed backends to include `"nvrtc"` and `"torch"`; route `"torch"` backend to Metal adapter.
JIT adapters & wrappers `tilelang/jit/adapter/__init__.py`, `tilelang/jit/adapter/utils.py`, `tilelang/jit/adapter/wrapper.py`, `tilelang/jit/adapter/cython/adapter.py`, `tilelang/jit/adapter/torch/__init__.py`, `tilelang/jit/adapter/torch/metal.py`	Add `is_metal_target`; implement `MetalKernelAdapter` and `TLMetalSourceWrapper`; add macOS linker flag selection; route cython adapter device to MPS; expose Metal adapter via torch package.
Architecture modules `tilelang/carver/arch/__init__.py`, `tilelang/carver/arch/metal.py`, `tilelang/carver/arch/*`	Add `METAL` arch class and `is_metal_arch`; auto-infer Metal via `torch.mps.is_available()`; standardize `__all__` exports across arch modules.
Testing `testing/python/metal/test_metal_codegen.py`, `tilelang/testing/__init__.py`	Add Metal GEMM tests targeting MPS; explicitly expose `requires_metal` and testing symbols via `__all__`.
Runtime utilities, profiler, autotuner, cache, tensor `tilelang/utils/tensor.py`, `tilelang/utils/device.py`, `tilelang/profiler/bench.py`, `tilelang/autotuner/tuner.py`, `tilelang/cache/kernel_cache.py`	Introduce device helper usage; generalize profiler/timing for CUDA/MPS; ensure per-device compile context in autotuner; broaden kernel_cache disk-skip to include `"torch"`; tensor utilities use `get_current_device()`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant JIT as tilelang.jit
  participant WR as TL Wrapper
  participant LWR as Lowering
  participant TVM as TVM (target.build.metal)
  participant AD as MetalKernelAdapter
  participant MPS as torch.mps runtime

  Dev->>JIT: jit(func, target="metal", execution_backend="torch")
  JIT->>JIT: validate is_metal_target -> require backend=="torch"
  JIT->>WR: wrap(scheduled_ir, source, target=metal)
  WR-->>JIT: TLMetalSourceWrapper
  JIT->>LWR: device_codegen_without_compile(target=metal)
  LWR->>TVM: call target.build.metal(device_mod, target)
  TVM-->>JIT: device_mod + kernel_source
  JIT->>AD: create MetalKernelAdapter(func_or_mod, device_mod, kernel_source)
  Dev->>AD: call kernel(...)
  AD->>AD: on first call: torch.mps.compile_shader(kernel_source)
  AD->>MPS: launch compiled kernel with threads/group_size
  MPS-->>Dev: result
  note over AD: compiled launcher cached for subsequent calls

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

enhancement

Suggested reviewers

tzj-fxz

Poem

I thump my paw on macOS ground,
New Metal paths and kernels found.
Torch compiles shaders, threads align,
MPS hops swift — the carrots shine. 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “[Backend] Add metal backend” succinctly and accurately summarizes the primary change of integrating a Metal backend into the project, directly reflecting the main focus of the extensive modifications without introducing extraneous details. It is concise, clear, and enables teammates to quickly understand the core purpose of the pull request.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 19fd7dc and 2451073.

📒 Files selected for processing (1)

tilelang/profiler/bench.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tilelang/profiler/bench.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-test-metal
GitHub Check: build-test-amd
GitHub Check: format-check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-09-10T09:44:14Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

oraluben · 2025-09-12T03:50:10Z

While there're tasks remain, the main implementation is ready for review @LeiWang1999

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (14)

tilelang/carver/arch/cdna.py (1)

35-38: Sort all (RUF022).

Keep exports alphabetized for consistency and to satisfy Ruff.

Apply:
-__all__ = [
-    'is_cdna_arch',
-    'CDNA',
-]
+__all__ = [
+    'CDNA',
+    'is_cdna_arch',
+]

tilelang/carver/arch/cpu.py (1)

23-26: Sort all (RUF022).

Align with other arch modules.

-__all__ = [
-    'is_cpu_arch',
-    'CPU',
-]
+__all__ = [
+    'CPU',
+    'is_cpu_arch',
+]

tilelang/contrib/cc.py (1)

93-95: Add typing/docstring to is_mac().

Small polish; helps usage elsewhere.

-def is_mac():
-    return platform.system() == 'Darwin'
+def is_mac() -> bool:
+    "True if running on macOS (Darwin)."
+    return platform.system() == "Darwin"

tilelang/carver/arch/metal.py (1)

12-15: Sort all (RUF022).

Match the pattern used in other arch modules.
-__all__ = [
-    'is_metal_arch',
-    'METAL',
-]
+__all__ = [
+    'METAL',
+    'is_metal_arch',
+]

tilelang/carver/arch/cuda.py (1)

150-159: Sort all (RUF022).

Keep exports alphabetized.

-__all__ = [
-    'is_cuda_arch',
-    'is_volta_arch',
-    'is_ampere_arch',
-    'is_ada_arch',
-    'is_hopper_arch',
-    'is_tensorcore_supported_precision',
-    'has_mma_support',
-    "CUDA",
-]
+__all__ = [
+    "CUDA",
+    'has_mma_support',
+    'is_ada_arch',
+    'is_ampere_arch',
+    'is_cuda_arch',
+    'is_hopper_arch',
+    'is_tensorcore_supported_precision',
+    'is_volta_arch',
+]

tilelang/utils/target.py (3)

1-1: Import platform.machine() for fallback.

Some Python builds return empty arch from mac_ver(); fall back to machine().

-from platform import mac_ver
+from platform import mac_ver, machine

46-52: Harden Metal detection.

Use machine() as a fallback when mac_ver()[2] is empty.

-def check_metal_availability() -> bool:
-    mac_release, _, arch = mac_ver()
-    if not mac_release:
-        return False
-    # todo: check torch version?
-    return arch == 'arm64'
+def check_metal_availability() -> bool:
+    mac_release, _, arch = mac_ver()
+    if not mac_release:
+        return False
+    # Fallback for environments where mac_ver() doesn't report machine.
+    arch = arch or machine()
+    return arch == "arm64"

84-87: Error message: say “Metal” (and keep it short per TRY003).

Also consider updating determine_target docstring to mention Metal.

-        else:
-            raise ValueError("No CUDA or HIP or MPS available on this system.")
+        else:
+            raise ValueError("No CUDA, HIP, or Metal available on this system.")

Docstring tweak (outside hunk):

"""
Determine the appropriate target for compilation (CUDA, HIP, Metal, or manual selection).
"""

tilelang/engine/lower.py (1)

184-186: Parity: add Metal branch to device_codegen as well (or explicitly gate).

device_codegen_without_compile supports "metal", but device_codegen lacks a corresponding branch; enabling device compile for Metal will currently raise.

Proposed addition (outside this hunk) in device_codegen():
elif target.kind.name == "metal":
    device_mod = tvm.ffi.get_global_func("target.build.metal")(device_mod, target)

install_metal.sh (1)

20-22: Consider more conservative parallelism settings

Using half the logical CPU cores might be too aggressive for some systems. Consider using a safer default or allowing configuration via environment variable.
-CORES=$(sysctl -n hw.logicalcpu)
-MAKE_JOBS=$(( CORES / 2 ))
-make -j${MAKE_JOBS}
+CORES=$(sysctl -n hw.logicalcpu)
+MAKE_JOBS=${MAKE_JOBS:-$(( CORES / 2 ))}
+echo "Building with ${MAKE_JOBS} parallel jobs"
+make -j${MAKE_JOBS}

tilelang/carver/arch/__init__.py (1)

40-55: Sort all list for better maintainability

The static analysis correctly identified that all should be sorted for consistency and easier maintenance.

 __all__ = [
-    'is_cpu_arch',
-    'is_cuda_arch',
-    'is_volta_arch',
-    'is_ampere_arch',
-    'is_ada_arch',
-    'is_hopper_arch',
-    'is_tensorcore_supported_precision',
-    'has_mma_support',
-    'is_cdna_arch',
-    'is_metal_arch',
+    'CDNA',
+    'CPU',
     'CUDA',
-    'CDNA',
     'METAL',
-    'CPU',
+    'has_mma_support',
+    'is_ada_arch',
+    'is_ampere_arch',
+    'is_cdna_arch',
+    'is_cpu_arch',
+    'is_cuda_arch',
+    'is_hopper_arch',
+    'is_metal_arch',
+    'is_tensorcore_supported_precision',
+    'is_volta_arch',
 ]

tilelang/jit/adapter/wrapper.py (1)

1008-1028: Consider adding docstring and error handling.

The new TLMetalSourceWrapper class lacks documentation and error handling. Consider adding:

A class docstring explaining its purpose
Type hints validation
Error handling for Metal-specific operations

 class TLMetalSourceWrapper(object):
+    """
+    A wrapper class for the TileLang Metal backend.
+    
+    This wrapper handles Metal shader code generation for Apple Silicon GPUs.
+    Unlike CUDA/HIP wrappers, it performs minimal processing as Metal compilation
+    is handled downstream by PyTorch MPS.
+    """
 
     def __init__(self,
                  scheduled_ir_module: IRModule,
                  source: str,
                  target: Target,
                  device_mod: Optional[IRModule] = None,
                  host_mod: Optional[IRModule] = None,
                  pass_configs: Optional[Dict[str, Any]] = None):
+        if not is_metal_target(target):
+            raise ValueError(f"Expected Metal target, got {target.kind.name}")
         self.mod = scheduled_ir_module
         self.target = target
         self.source = source
         self.pass_configs = pass_configs
         self.device_mod = device_mod
         self.host_mod = host_mod
         self.lib_code = self.update_lib_code(source)
 
     def update_lib_code(self, code: str):
+        """Update and return the library code without modification.
+        
+        Metal shaders don't require the host function wrapping that
+        CUDA/HIP backends need.
+        """
         self.lib_code = code
         return self.lib_code

tilelang/jit/adapter/torch/metal.py (1)

14-26: Clean up commented parameters or document why they're excluded.

The constructor has many commented-out parameters. Either remove them entirely or add a comment explaining why they're excluded from the Metal implementation.

     def __init__(
         self,
-        #  params: List[KernelParam],
-        #  result_idx: List[int],
-        #  target: Union[str, Target],
         func_or_mod: Union[tir.PrimFunc, tvm.IRModule],
-        #  host_mod: Optional[tvm.IRModule] = None,
         device_mod: Optional[tvm.IRModule] = None,
         kernel_global_source: Optional[str] = None,
         verbose: bool = False,
-        #  pass_configs: Optional[Dict[str, Any]] = None,
-        #  compile_flags: Optional[List[str]] = None
     ):
+        """Initialize MetalKernelAdapter for PyTorch MPS execution.
+        
+        Note: Unlike other adapters, Metal doesn't require params, result_idx,
+        or compilation flags as these are handled by PyTorch MPS.
+        
+        Args:
+            func_or_mod: The function or module to compile
+            device_mod: Device module containing kernel metadata
+            kernel_global_source: Metal shader source code
+            verbose: Enable verbose logging
+        """

testing/python/metal/test_metal_codegen.py (1)

70-72: Consider more appropriate tolerance for float16.

An absolute tolerance of 1.0 for float16 seems too permissive for GEMM accuracy testing. Consider using a relative tolerance or a smaller absolute tolerance.
 def test_gemm_float16():
-    assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1)
+    # Use relative tolerance for float16 to better handle different magnitude results
+    assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1e-2)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 409ab83 and 0eba92f6ad86fca30149d937543b01069784eafd.

📒 Files selected for processing (19)

CMakeLists.txt (3 hunks)
install_metal.sh (1 hunks)
testing/python/metal/test_metal_codegen.py (1 hunks)
tilelang/cache/kernel_cache.py (1 hunks)
tilelang/carver/arch/__init__.py (3 hunks)
tilelang/carver/arch/cdna.py (1 hunks)
tilelang/carver/arch/cpu.py (1 hunks)
tilelang/carver/arch/cuda.py (1 hunks)
tilelang/carver/arch/metal.py (1 hunks)
tilelang/contrib/cc.py (2 hunks)
tilelang/engine/lower.py (1 hunks)
tilelang/jit/__init__.py (3 hunks)
tilelang/jit/adapter/__init__.py (1 hunks)
tilelang/jit/adapter/cython/adapter.py (3 hunks)
tilelang/jit/adapter/torch/metal.py (1 hunks)
tilelang/jit/adapter/utils.py (1 hunks)
tilelang/jit/adapter/wrapper.py (3 hunks)
tilelang/jit/kernel.py (4 hunks)
tilelang/utils/target.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (10)

tilelang/jit/adapter/utils.py (1)

tilelang/language/ast/ir.py (1)

target (1682-1713)

tilelang/carver/arch/metal.py (2)

tilelang/carver/arch/arch_base.py (1)

TileDevice (4-37)

tilelang/carver/template/base.py (1)

arch (152-159)

tilelang/jit/adapter/cython/adapter.py (3)

tilelang/jit/adapter/utils.py (4)

is_cuda_target (51-52)

is_hip_target (55-56)

is_cpu_target (59-60)

is_metal_target (63-64)

tilelang/utils/target.py (1)

determine_target (54-96)

tilelang/contrib/cc.py (2)

get_cplus_compiler (68-90)

is_mac (93-94)

tilelang/jit/adapter/__init__.py (1)

tilelang/jit/adapter/torch/metal.py (1)

MetalKernelAdapter (12-67)

tilelang/jit/kernel.py (3)

tilelang/jit/__init__.py (1)

jit (241-314)

tilelang/jit/adapter/utils.py (1)

is_metal_target (63-64)

tilelang/jit/adapter/torch/metal.py (1)

MetalKernelAdapter (12-67)

tilelang/carver/arch/__init__.py (1)

tilelang/carver/arch/metal.py (1)

METAL (8-9)

tilelang/jit/__init__.py (2)

tilelang/jit/adapter/utils.py (1)

is_metal_target (63-64)

tilelang/language/ast/ir.py (1)

target (1682-1713)

testing/python/metal/test_metal_codegen.py (6)

tilelang/jit/__init__.py (1)

jit (241-314)

tilelang/language/allocate.py (2)

alloc_shared (21-36)

alloc_fragment (53-64)

tilelang/language/fill.py (1)

clear (24-48)

tilelang/language/pipeline.py (1)

Pipelined (9-46)

tilelang/language/copy.py (1)

copy (84-152)

tilelang/jit/kernel.py (1)

kernel_source (470-471)

tilelang/jit/adapter/torch/metal.py (2)

tilelang/jit/adapter/base.py (2)

BaseKernelAdapter (8-55)

_post_init (54-55)

tilelang/jit/adapter/nvrtc/adapter.py (1)

_convert_torch_func (240-241)

tilelang/jit/adapter/wrapper.py (4)

tilelang/jit/adapter/utils.py (1)

is_metal_target (63-64)

tilelang/jit/adapter/cython/adapter.py (1)

lib_code (511-513)

tilelang/jit/adapter/ctypes/adapter.py (1)

lib_code (281-283)

tilelang/jit/adapter/libgen.py (1)

update_lib_code (53-54)

🪛 Ruff (0.12.2)

tilelang/carver/arch/cpu.py

23-26: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tilelang/carver/arch/cuda.py

150-159: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tilelang/utils/target.py

87-87: Avoid specifying long messages outside the exception class

(TRY003)

tilelang/carver/arch/metal.py

12-15: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tilelang/carver/arch/cdna.py

35-38: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tilelang/jit/adapter/__init__.py

5-5: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

6-6: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tilelang/carver/arch/__init__.py

2-2: from .cuda import * used; unable to detect undefined names

(F403)

3-3: from .cpu import * used; unable to detect undefined names

(F403)

4-4: from .cdna import * used; unable to detect undefined names

(F403)

5-5: from .metal import * used; unable to detect undefined names

(F403)

22-22: METAL may be undefined, or defined from star imports

(F405)

40-55: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

41-41: is_cpu_arch may be undefined, or defined from star imports

(F405)

42-42: is_cuda_arch may be undefined, or defined from star imports

(F405)

43-43: is_volta_arch may be undefined, or defined from star imports

(F405)

44-44: is_ampere_arch may be undefined, or defined from star imports

(F405)

45-45: is_ada_arch may be undefined, or defined from star imports

(F405)

46-46: is_hopper_arch may be undefined, or defined from star imports

(F405)

47-47: is_tensorcore_supported_precision may be undefined, or defined from star imports

(F405)

48-48: has_mma_support may be undefined, or defined from star imports

(F405)

49-49: is_cdna_arch may be undefined, or defined from star imports

(F405)

50-50: is_metal_arch may be undefined, or defined from star imports

(F405)

51-51: CUDA may be undefined, or defined from star imports

(F405)

52-52: CDNA may be undefined, or defined from star imports

(F405)

53-53: METAL may be undefined, or defined from star imports

(F405)

54-54: CPU may be undefined, or defined from star imports

(F405)

tilelang/jit/adapter/torch/metal.py

44-44: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (9)

tilelang/carver/arch/metal.py (1)
8-9: Confirm TileDevice API expectations for METAL.

METAL is currently a stub. If any path calls TileDevice.get_avaliable_tensorintrin_shapes(), this will raise NotImplementedError. If that call path can occur, add a trivial override returning [] to avoid surprises.

Example (outside this hunk):
class METAL(TileDevice):
    def get_avaliable_tensorintrin_shapes(self):
        return []
tilelang/utils/target.py (1)

16-16: LGTM: add 'metal' to AVALIABLE_TARGETS.

This aligns the public set with the new Metal path.

tilelang/jit/adapter/__init__.py (1)

1-6: LGTM!

The addition of MetalKernelAdapter to the adapter exports is clean and follows the existing pattern.

tilelang/jit/adapter/utils.py (1)

63-65: LGTM!

The is_metal_target function follows the established pattern and correctly identifies Metal targets.

tilelang/jit/adapter/cython/adapter.py (2)

156-158: LGTM! Platform-specific linker flags handled correctly

The dynamic linker flag selection based on the platform is appropriate for handling undefined symbols across macOS and Linux.

455-456: LGTM! Correct device mapping for Metal target

The Metal target correctly maps to the "mps" device in PyTorch.

tilelang/jit/adapter/wrapper.py (1)

1068-1069: LGTM! Metal target routing is correctly integrated.

The Metal target detection and wrapper selection follows the existing pattern for other backends.

tilelang/jit/adapter/torch/metal.py (1)

61-61: Verify thread calculation logic.

The threads calculation multiplies block and grid dimensions element-wise. This seems unusual - typically threads would be just the block dimensions, and the total work would be block * grid.

Could you verify if this is the correct thread configuration for PyTorch MPS? The typical pattern would be:

threads = block dimensions (threads per group)

group_size = grid dimensions (number of groups)

The current implementation passes threads=[block[i] * grid[i] for i in range(3)] which seems to represent total threads across all groups.

testing/python/metal/test_metal_codegen.py (1)

79-80: Good practice: MPS availability check.

The check for MPS availability before running tests is appropriate for Metal backend testing.

coderabbitai · 2025-09-12T04:04:26Z

+  find_library(TVM_LIBRARY_LOCATION
+    NAMES tvm
+    HINTS "${TVM_PREBUILD_PATH}"
+  )
  set_target_properties(tvm PROPERTIES
-    IMPORTED_LOCATION "${TVM_PREBUILD_PATH}/libtvm.so"
+    IMPORTED_LOCATION "${TVM_LIBRARY_LOCATION}"
    INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include"
  )


⚠️ Potential issue

Add error handling for missing TVM libraries

The find_library commands should check if the libraries were actually found before proceeding.

find_library(TVM_LIBRARY_LOCATION NAMES tvm HINTS "${TVM_PREBUILD_PATH}" ) +if(NOT TVM_LIBRARY_LOCATION) + message(FATAL_ERROR "Could not find TVM library in ${TVM_PREBUILD_PATH}") +endif() set_target_properties(tvm PROPERTIES IMPORTED_LOCATION "${TVM_LIBRARY_LOCATION}" INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

find_library(TVM_LIBRARY_LOCATION

NAMES tvm

HINTS "${TVM_PREBUILD_PATH}"

)

set_target_properties(tvm PROPERTIES

IMPORTED_LOCATION "${TVM_PREBUILD_PATH}/libtvm.so"

IMPORTED_LOCATION "${TVM_LIBRARY_LOCATION}"

INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include"

)

find_library(TVM_LIBRARY_LOCATION

NAMES tvm

HINTS "${TVM_PREBUILD_PATH}"

)

if(NOT TVM_LIBRARY_LOCATION)

message(FATAL_ERROR "Could not find TVM library in ${TVM_PREBUILD_PATH}")

endif()

set_target_properties(tvm PROPERTIES

IMPORTED_LOCATION "${TVM_LIBRARY_LOCATION}"

INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include"

)

🤖 Prompt for AI Agents

In CMakeLists.txt around lines 102 to 109, the find_library call for TVM does not verify that TVM_LIBRARY_LOCATION was found; add a check after find_library such as if(NOT TVM_LIBRARY_LOCATION) message(FATAL_ERROR "Could not find tvm library in ${TVM_PREBUILD_PATH}; please set TVM_PREBUILD_PATH or install tvm") to stop configuration when missing, and only call set_target_properties for the imported tvm target if the library path exists (or create the imported target conditionally); ensure the error message includes the searched path for easier debugging.

coderabbitai · 2025-09-12T04:04:27Z

+  find_library(TVM_RUNTIME_LIBRARY_LOCATION
+    NAMES tvm_runtime
+    HINTS "${TVM_PREBUILD_PATH}"
+  )
  set_target_properties(tvm_runtime PROPERTIES
-    IMPORTED_LOCATION "${TVM_PREBUILD_PATH}/libtvm_runtime.so"
+    IMPORTED_LOCATION "${TVM_RUNTIME_LIBRARY_LOCATION}"
    INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include"
  )


⚠️ Potential issue

Add error handling for missing TVM runtime library

Similar to the TVM library, the runtime library location should be validated.

find_library(TVM_RUNTIME_LIBRARY_LOCATION NAMES tvm_runtime HINTS "${TVM_PREBUILD_PATH}" ) +if(NOT TVM_RUNTIME_LIBRARY_LOCATION) + message(FATAL_ERROR "Could not find TVM runtime library in ${TVM_PREBUILD_PATH}") +endif() set_target_properties(tvm_runtime PROPERTIES IMPORTED_LOCATION "${TVM_RUNTIME_LIBRARY_LOCATION}" INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include" )

🤖 Prompt for AI Agents

In CMakeLists.txt around lines 111 to 118, the find_library result for TVM_RUNTIME_LIBRARY_LOCATION is not validated; add the same existence check and failure handling used for the TVM library: after find_library, test if TVM_RUNTIME_LIBRARY_LOCATION is NOTFOUND (or empty) and call message(FATAL_ERROR "...unable to locate tvm_runtime library at ${TVM_PREBUILD_PATH}..." ) so configuration stops with a clear error; only set_target_properties for tvm_runtime and INTERFACE_INCLUDE_DIRECTORIES when the library was successfully found (mirror the pattern used for the TVM library block).

coderabbitai · 2025-09-12T04:04:27Z

+        if execution_backend in ("dlpack", "torch"):
+            self.logger.warning("DLPack or torch backend does not support cache saving to disk.")
        else:


🛠️ Refactor suggestion

Include 'torch' in type hints and clarify log.

Execution backend Literal types in this file exclude 'torch' (class attr and method params). This will trip type-checkers now that you gate disk-cache on 'torch'.

Apply minimal log tweak here:

- if execution_backend in ("dlpack", "torch"): - self.logger.warning("DLPack or torch backend does not support cache saving to disk.") + if execution_backend in ("dlpack", "torch"): + self.logger.warning("Backends 'dlpack' and 'torch' do not support saving cache to disk.")

Also update type hints elsewhere in this file (outside the changed hunk):

# line ~42 execution_backend: Literal["dlpack", "ctypes", "cython", "nvrtc", "torch"] = "cython" # line ~71 execution_backend: Literal["dlpack", "ctypes", "cython", "nvrtc", "torch"] = "cython", # line ~117 execution_backend: Literal["dlpack", "ctypes", "cython", "nvrtc", "torch"] = "cython", # line ~323 execution_backend: Literal["dlpack", "ctypes", "cython", "nvrtc", "torch"] = "cython",

🤖 Prompt for AI Agents

In tilelang/cache/kernel_cache.py around lines 189-191, the warning message and the execution_backend type hints omit 'torch', causing type-check failures and unclear logs; update the warning to mention "DLPack or torch backend does not support saving cache to disk" (or similar clearer phrasing) and add "torch" to the Literal union type hints at the listed locations: ~42, ~71, ~117, ~323 so those class attributes and method params read Literal["dlpack", "ctypes", "cython", "nvrtc", "torch"] = "cython" (or the equivalent default), ensuring consistency across the file.

coderabbitai · 2025-09-12T04:04:27Z

+from .cuda import *
+from .cpu import *
+from .cdna import *
+from .metal import *


🛠️ Refactor suggestion

Wildcard imports reduce code clarity and maintainability

The wildcard imports make it difficult to track which symbols come from which module and can lead to namespace pollution. Consider explicit imports for better maintainability.

🧰 Tools

🪛 Ruff (0.12.2)

2-2: from .cuda import * used; unable to detect undefined names

(F403)

3-3: from .cpu import * used; unable to detect undefined names

(F403)

4-4: from .cdna import * used; unable to detect undefined names

(F403)

5-5: from .metal import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents

In tilelang/carver/arch/__init__.py around lines 2 to 5, the module currently uses wildcard imports (from .cuda import * etc.), which pollutes the package namespace and obscures which symbols are exported; replace these with explicit imports of the public symbols you need from each submodule (or import the modules themselves and re-export a curated __all__ list) so callers see a clear API and accidental name collisions are avoided. Ensure you list each function/class/constant you want to expose (or import submodules as namespaced attributes) and update or add __all__ to reflect the exported names.

coderabbitai · 2025-09-12T04:04:27Z

+    if is_metal_target(target):
+        assert execution_backend == 'torch', 'Currently metal target only support `tl.jit(execution_backend="torch")`'


⚠️ Potential issue

Fix inconsistent backend naming

The assertion expects 'torch' backend but the public API and documentation use 'nvrtc'. This inconsistency is confusing.

The error message should clarify that Metal targets require the "nvrtc" backend (which internally uses torch for Metal):

if is_metal_target(target): - assert execution_backend == 'torch', 'Currently metal target only support `tl.jit(execution_backend="torch")`' + assert execution_backend == 'nvrtc', 'Currently metal target only supports `tl.jit(execution_backend="nvrtc")`'

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if is_metal_target(target):

assert execution_backend == 'torch', 'Currently metal target only support `tl.jit(execution_backend="torch")`'

if is_metal_target(target):

assert execution_backend == 'nvrtc', 'Currently metal target only supports `tl.jit(execution_backend="nvrtc")`'

🤖 Prompt for AI Agents

In tilelang/jit/__init__.py around lines 78-79, the assertion currently checks for execution_backend == 'torch' but the public API/doc use 'nvrtc'; change the assertion to require 'nvrtc' instead and update the assertion message to state that Metal targets require the "nvrtc" backend (which internally uses torch) so the message aligns with public API and avoids confusion.

coderabbitai · 2025-09-12T04:04:27Z

+        self.kernel_name = func_or_mod.__name__ + '_kernel'
+        self.verbose = verbose


⚠️ Potential issue

Potential issue with kernel name generation.

The kernel name is generated by appending '_kernel' to the function name. This assumes func_or_mod has a __name__ attribute, which may not be true for all IRModule instances.

self.kernel_global_source = kernel_global_source - self.kernel_name = func_or_mod.__name__ + '_kernel' + if hasattr(func_or_mod, '__name__'): + self.kernel_name = func_or_mod.__name__ + '_kernel' + elif hasattr(func_or_mod, 'get_global_vars'): + # Handle IRModule case + global_vars = func_or_mod.get_global_vars() + if global_vars: + self.kernel_name = global_vars[0].name_hint + '_kernel' + else: + raise ValueError("Cannot determine kernel name from IRModule") + else: + raise ValueError(f"Cannot determine kernel name from {type(func_or_mod)}") self.verbose = verbose

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.kernel_name = func_or_mod.__name__ + '_kernel'

self.verbose = verbose

self.kernel_global_source = kernel_global_source

if hasattr(func_or_mod, '__name__'):

self.kernel_name = func_or_mod.__name__ + '_kernel'

elif hasattr(func_or_mod, 'get_global_vars'):

# Handle IRModule case

global_vars = func_or_mod.get_global_vars()

if global_vars:

self.kernel_name = global_vars[0].name_hint + '_kernel'

else:

raise ValueError("Cannot determine kernel name from IRModule")

else:

raise ValueError(f"Cannot determine kernel name from {type(func_or_mod)}")

self.verbose = verbose

🤖 Prompt for AI Agents

In tilelang/jit/adapter/torch/metal.py around lines 28-29, kernel_name is set by blindly using func_or_mod.__name__ which can fail for IRModule or other objects without a __name__; change to safely derive a base name: try getattr(func_or_mod, "__name__", None), then getattr(func_or_mod, "name", None), and if both are missing fall back to a stable fallback such as f"{type(func_or_mod).__name__}_{id(func_or_mod):x}"; sanitize the base to be a valid identifier if needed and then set self.kernel_name = f"{base}_kernel".

coderabbitai · 2025-09-12T04:04:27Z

+        for var, func in device_mod.functions.items():
+            assert var.name_hint == self.kernel_name
+            thread_extent = func.attrs['thread_extent']
+            for tag, extent in thread_extent.items():
+                if "threadIdx" in tag:
+                    self.block_info["xyz".index(tag[-1])] = extent
+                elif "blockIdx" in tag:
+                    self.grid_info["xyz".index(tag[-1])] = extent
+            break
+        else:
+            raise AssertionError(f'no kernel with name {func_or_mod.__name__}')
+


⚠️ Potential issue

Fix critical issue: iteration over device module functions doesn't handle multiple functions correctly.

The loop breaks after processing the first function regardless of whether it matches kernel_name. This could silently process the wrong kernel if multiple functions exist.

for var, func in device_mod.functions.items(): - assert var.name_hint == self.kernel_name - thread_extent = func.attrs['thread_extent'] - for tag, extent in thread_extent.items(): - if "threadIdx" in tag: - self.block_info["xyz".index(tag[-1])] = extent - elif "blockIdx" in tag: - self.grid_info["xyz".index(tag[-1])] = extent - break + if var.name_hint == self.kernel_name: + thread_extent = func.attrs.get('thread_extent', {}) + for tag, extent in thread_extent.items(): + if "threadIdx" in tag: + self.block_info["xyz".index(tag[-1])] = extent + elif "blockIdx" in tag: + self.grid_info["xyz".index(tag[-1])] = extent + break else: - raise AssertionError(f'no kernel with name {func_or_mod.__name__}') + raise AssertionError(f'No kernel with name {self.kernel_name} found in device module')

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for var, func in device_mod.functions.items():

assert var.name_hint == self.kernel_name

thread_extent = func.attrs['thread_extent']

for tag, extent in thread_extent.items():

if "threadIdx" in tag:

self.block_info["xyz".index(tag[-1])] = extent

elif "blockIdx" in tag:

self.grid_info["xyz".index(tag[-1])] = extent

break

else:

raise AssertionError(f'no kernel with name {func_or_mod.__name__}')

for var, func in device_mod.functions.items():

if var.name_hint == self.kernel_name:

thread_extent = func.attrs.get('thread_extent', {})

for tag, extent in thread_extent.items():

if "threadIdx" in tag:

self.block_info["xyz".index(tag[-1])] = extent

elif "blockIdx" in tag:

self.grid_info["xyz".index(tag[-1])] = extent

break

else:

raise AssertionError(f'No kernel with name {self.kernel_name} found in device module')

🧰 Tools

🪛 Ruff (0.12.2)

44-44: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In tilelang/jit/adapter/torch/metal.py around lines 34 to 45, the loop over device_mod.functions currently asserts the first function has the desired kernel_name and breaks unconditionally, which can process the wrong function when multiple functions exist; change the logic to search for a function whose var.name_hint equals self.kernel_name (don't assert immediately), extract its thread_extent and populate block_info/grid_info, break when found, and after the loop raise an AssertionError if no matching kernel was found (use self.kernel_name in the error message) so only the correct kernel is processed.

coderabbitai · 2025-09-12T04:04:28Z

+
+        if self._kernel is None:
+
+            _kernel = getattr(torch.mps.compile_shader(self.kernel_global_source), self.kernel_name)


🛠️ Refactor suggestion

Add error handling for shader compilation.

The shader compilation could fail but there's no error handling. Also, the compiled shader might not contain the expected kernel.

- _kernel = getattr(torch.mps.compile_shader(self.kernel_global_source), self.kernel_name) + try: + compiled_shader = torch.mps.compile_shader(self.kernel_global_source) + if not hasattr(compiled_shader, self.kernel_name): + raise AttributeError(f"Compiled shader does not contain kernel '{self.kernel_name}'") + _kernel = getattr(compiled_shader, self.kernel_name) + except Exception as e: + raise RuntimeError(f"Failed to compile Metal shader: {e}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_kernel = getattr(torch.mps.compile_shader(self.kernel_global_source), self.kernel_name)

try:

compiled_shader = torch.mps.compile_shader(self.kernel_global_source)

if not hasattr(compiled_shader, self.kernel_name):

raise AttributeError(f"Compiled shader does not contain kernel '{self.kernel_name}'")

_kernel = getattr(compiled_shader, self.kernel_name)

except Exception as e:

raise RuntimeError(f"Failed to compile Metal shader: {e}")

🤖 Prompt for AI Agents

In tilelang/jit/adapter/torch/metal.py around line 54, wrap the call to torch.mps.compile_shader and subsequent attribute access in a try/except and validate the returned object contains the expected kernel; specifically, catch exceptions from compile_shader, raise or log a clear error that includes the original exception and the kernel name/source context, and after compilation check with hasattr or explicit lookup that the compiled object exposes self.kernel_name, raising a descriptive error if missing so callers know compilation succeeded but the kernel was not found.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tilelang/jit/adapter/__init__.py (1)
5-6: Remove unused # noqa: F401 (RUF100) on new exports.

Ruff flags these as unused because F401 isn’t enabled; just drop the directives.
-from .nvrtc import NVRTCKernelAdapter  # noqa: F401
-from .torch import MetalKernelAdapter  # noqa: F401
+from .nvrtc import NVRTCKernelAdapter
+from .torch import MetalKernelAdapter
Optional: make exports explicit to avoid needing noqa anywhere and document the public API.
# add near the top of this module
__all__ = [
    "BaseKernelAdapter",
    "TorchDLPackKernelAdapter",
    "CtypesKernelAdapter",
    "CythonKernelAdapter",
    "NVRTCKernelAdapter",
    "MetalKernelAdapter",
]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0eba92f6ad86fca30149d937543b01069784eafd and 7514cf545a9b9ec3d999fc1b1f6bd098397d19e1.

📒 Files selected for processing (2)

tilelang/jit/adapter/__init__.py (1 hunks)
tilelang/jit/adapter/torch/__init__.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

tilelang/jit/adapter/torch/init.py

🧰 Additional context used

🧬 Code graph analysis (1)

tilelang/jit/adapter/__init__.py (2)

tilelang/jit/adapter/nvrtc/adapter.py (1)

NVRTCKernelAdapter (30-246)

tilelang/jit/adapter/torch/metal.py (1)

MetalKernelAdapter (12-67)

🪛 Ruff (0.12.2)

tilelang/jit/adapter/__init__.py

5-5: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

6-6: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: bot-task

@LeiWang1999

Removed external library DLL names for macOS. found during tile-ai/tilelang#799 cc @LeiWang1999

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

setup.py (3)

345-347: Wheel may miss runtime libs: include tilelang/lib/.so|.dylib in package_data.

Without this, copied libs might not land in wheels/sdists.

 package_data = {
-    "tilelang": ["py.typed", "*pyx"],
+    "tilelang": ["py.typed", "*pyx", "lib/*.so", "lib/*.dylib"],
 }

805-807: Do not pass CMAKE_CUDA_COMPILER unless CUDA is enabled.

On Metal-only/mac builds, this inserts a bogus nvcc path and breaks configure.

-        if not USE_ROCM:
-            cmake_args.append(f"-DCMAKE_CUDA_COMPILER={os.path.join(CUDA_HOME, 'bin', 'nvcc')}")
+        if USE_CUDA and CUDA_HOME:
+            cmake_args.append(f"-DCMAKE_CUDA_COMPILER={os.path.join(CUDA_HOME, 'bin', 'nvcc')}")

826-839: Respect USE_CUDA toggle in config.cmake generation.

Current logic enables CUDA whenever CUDA_HOME is set, even if USE_CUDA=False.

-        elif CUDA_HOME:
+        elif USE_CUDA and CUDA_HOME:
             content_lines += [
                 f"set(USE_CUDA {CUDA_HOME})",
                 "set(USE_ROCM OFF)",
             ]
+        else:
+            content_lines += [
+                "set(USE_CUDA OFF)",
+                "set(USE_ROCM OFF)",
+            ]

🧹 Nitpick comments (9)

setup.py (9)

48-50: Harden Metal detection (Rosetta/process-arch pitfalls).

Use Darwin + machine for reliability.

-IS_LINUX = platform.system() == 'Linux'
-MAYBE_METAL = platform.mac_ver()[2] == 'arm64'
+IS_LINUX = platform.system() == 'Linux'
+IS_DARWIN = platform.system() == 'Darwin'
+MAYBE_METAL = IS_DARWIN and platform.machine() == 'arm64'

52-61: Defaulting USE_CUDA=True on Linux is user-hostile (raises when CUDA_HOME missing).

Consider defaulting to False or detecting from CUDA_HOME post-import to avoid hard failures on non-GPU Linux dev boxes.

If you want a minimal tweak now, set default False:

-USE_CUDA = _read_bool_env("USE_CUDA", IS_LINUX and not USE_ROCM)
+USE_CUDA = _read_bool_env("USE_CUDA", False)

99-103: Tighten exception text (Ruff TRY003) and clarity.

Shorten the message; keep actionable content.

-    raise ValueError(
-        "CUDA support is enabled by default on linux if `USE_ROCM=False`," \
-        " but CUDA_HOME is not set or detected.")
+    raise ValueError("USE_CUDA=True but CUDA_HOME is not set or detected.")

105-108: Same as above: shorten/standardize error message.

-    raise ValueError(
-        "Failed to automatically detect CUDA or ROCM installation. Please set the CUDA_HOME or ROCM_HOME environment variable manually (e.g., export CUDA_HOME=/usr/local/cuda or export ROCM_HOME=/opt/rocm)."
-    )
+    raise ValueError("Neither CUDA_HOME nor ROCM_HOME detected on Linux; set one explicitly.")

482-489: Guard dSYM deletion.

Use isdir check to avoid accidental rmtree on files with matching names.

-            for dir in potential_dirs:
-                candidate = os.path.join(dir, item)
-                if os.path.exists(candidate):
-                    shutil.rmtree(candidate)
-                    break
+            for d in potential_dirs:
+                candidate = os.path.join(d, item)
+                if os.path.isdir(candidate):
+                    shutil.rmtree(candidate)
+                    break

664-664: Copy-from-glob likely no-ops in non-inplace builds.

CMake outputs to extdir/build_temp, not CWD. Either search extdir/build_temp or drop this block (TileLangBuilPydCommand already handles copying).

270-273: Handle missing CMake gracefully (None from shutil.which).

os.path.exists(None) raises TypeError; raise a clear FileNotFoundError instead.

-    cmake_path = shutil.which("cmake")
-    if not os.path.exists(cmake_path):
-        raise Exception("CMake is not installed, please install it first.")
+    cmake_path = shutil.which("cmake")
+    if not cmake_path or not os.path.exists(cmake_path):
+        raise FileNotFoundError("CMake is not installed or not on PATH.")

149-168: Make ROCm version detection robust if rocminfo is missing.

Without try/except, subprocess errors abort the build even when the cmake file can be parsed.

-def get_rocm_version():
+def get_rocm_version():
     """Get the ROCM version from rocminfo."""
-    rocm_output = subprocess.check_output(["rocminfo"], universal_newlines=True)
-    # Parse ROCM version from output
-    # Example output: ROCM version: x.y.z-...
-    match = re.search(r'ROCm Version: (\d+\.\d+\.\d+)', rocm_output)
-    if match:
-        return Version(match.group(1))
-    else:
+    try:
+        rocm_output = subprocess.check_output(["rocminfo"], universal_newlines=True)
+        match = re.search(r'ROCm Version: (\d+\.\d+\.\d+)', rocm_output)
+        if match:
+            return Version(match.group(1))
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        pass
     rocm_path = os.environ.get("ROCM_PATH", "/opt/rocm")
     rocm_version_file = os.path.join(rocm_path, "lib", "cmake", "rocm",
                                      "rocm-config-version.cmake")
     if os.path.exists(rocm_version_file):
         with open(rocm_version_file, "r") as f:
             content = f.read()
             match = re.search(r'set\(PACKAGE_VERSION "(\d+\.\d+\.\d+)"', content)
             if match:
                 return Version(match.group(1))
     # return a default
     return Version("5.0.0")

882-895: Update PyPI metadata for Metal/macOS builds.

Expose macOS in platforms/classifiers when USE_METAL is on; optionally add/remove CUDA/ROCm env classifiers based on toggles.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7514cf545a9b9ec3d999fc1b1f6bd098397d19e1 and e1e18873a51c22fce0d9495e989c7253e871f99f.

📒 Files selected for processing (1)

setup.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

setup.py (1)

tilelang/env.py (1)

get (136-139)

🪛 Ruff (0.12.2)

setup.py

100-102: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: bot-task

🔇 Additional comments (1)

setup.py (1)

390-392: Linux-only rpath patch exit is correct.

Early-return avoids patchelf on macOS. LGTM.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

setup.py (3)
808-811: Fix unconditional CUDA compiler flag (breaks Metal/macOS and CUDA-less builds).

-DCMAKE_CUDA_COMPILER is appended whenever ROCm is disabled, even if CUDA is not requested or CUDA_HOME is unset. On macOS/Metal this can crash with a TypeError (joining path with None) and on Linux CPU/ROCm builds it forces CUDA.

Apply:
-        if not USE_ROCM:
-            cmake_args.append(f"-DCMAKE_CUDA_COMPILER={os.path.join(CUDA_HOME, 'bin', 'nvcc')}")
+        # Only set CUDA compiler when CUDA is explicitly enabled and detected
+        if USE_CUDA and CUDA_HOME:
+            cmake_args.append(f"-DCMAKE_CUDA_COMPILER={os.path.join(CUDA_HOME, 'bin', 'nvcc')}")
667-676: Editable install: also copy .so on macOS.

Python C-extensions on macOS commonly end with .so. Current logic only grabs .dylib on macOS, causing missing libs in editable mode.
-        files = glob.glob("*.so" if IS_LINUX else "*.dylib")
+        files = sorted(set(glob.glob("*.so") + glob.glob("*.dylib")))
269-277: Handle missing CMake path; avoid TypeError on os.path.exists(None).

shutil.which("cmake") may return None; calling os.path.exists(None) raises a TypeError.
 def get_cmake_path() -> str:
@@
-    cmake_path = shutil.which("cmake")
-    if not os.path.exists(cmake_path):
+    cmake_path = shutil.which("cmake")
+    if not cmake_path or not os.path.exists(cmake_path):
         raise Exception("CMake is not installed, please install it first.")
     return cmake_path

♻️ Duplicate comments (1)

setup.py (1)

36-44: Boolean env parsing still brittle (whitespace/yes-no not handled; empty string coerced to False).

You addressed the earlier issue but missed .strip() and common synonyms, and you now treat an empty string as False rather than “unset → default”.

Apply:

-def _read_bool_env(name: str, default: bool = False) -> bool:
-    if env := os.environ.get(name):
-        env = env.lower()
-        if env in ['on', '1', 'true']:
-            return True
-        elif env in ['', 'off', '0', 'false']:
-            return False
-    return default
+def _read_bool_env(name: str, default: bool = False) -> bool:
+    val = os.environ.get(name)
+    if val is None:
+        return default
+    s = val.strip().lower()
+    if s in ("1", "true", "on", "yes", "y"):
+        return True
+    if s in ("0", "false", "off", "no", "n"):
+        return False
+    return default

🧹 Nitpick comments (4)

setup.py (4)
51-53: Make Metal default detection more robust.

Use explicit Darwin + machine check; platform.mac_ver()[2] is brittle and can vary.
-IS_LINUX = platform.system() == 'Linux'
-MAYBE_METAL = platform.mac_ver()[2] == 'arm64'
+IS_LINUX = platform.system() == 'Linux'
+MAYBE_METAL = (platform.system() == 'Darwin' and platform.machine() == 'arm64')
829-843: When enabling Metal, explicitly disable CUDA/ROCm in config.cmake for clarity.

Prevents accidental dual-backend config if upstream defaults change.
-        if USE_METAL:
-            content_lines += [
-                "set(USE_METAL ON)",
-            ]
+        if USE_METAL:
+            content_lines += [
+                "set(USE_METAL ON)",
+                "set(USE_CUDA OFF)",
+                "set(USE_ROCM OFF)",
+            ]
893-898: Reflect macOS support in Trove classifiers.

Now that Metal/macOS is supported, advertise it.
         "Programming Language :: Python :: 3.8",
         "License :: OSI Approved :: MIT License",
-        "Operating System :: OS Independent",
+        "Operating System :: OS Independent",
+        "Operating System :: MacOS :: MacOS X",
         "Intended Audience :: Developers",
         "Intended Audience :: Science/Research",
102-111: Minor: tighten error messages and conditions (style).

Long exception messages trigger Ruff TRY003; consider shorter texts or helper constants. Also consider allowing Linux CPU-only builds in the future by gating the “require CUDA/ROCM” check behind an explicit flag.

If you want, I can open a follow-up to align linter expectations and make the Linux GPU requirement opt-in via an env (e.g., REQUIRE_GPU=1).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1e18873a51c22fce0d9495e989c7253e871f99f and cc91c3e22afffbb80f1dbec199fd55d91748adc5.

📒 Files selected for processing (1)

setup.py (6 hunks)

🧰 Additional context used

🪛 Ruff (0.12.2)

setup.py

103-105: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: bot-task

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

tilelang/carver/arch/metal.py (1)
17-20: Sort all to satisfy Ruff RUF022.
 __all__ = [
-    'is_metal_arch',
-    'METAL',
+    'METAL',
+    'is_metal_arch',
 ]
.github/workflows/metal_ci.yml (2)
12-14: Tighten permissions and fix trailing space.

contents: write isn’t needed; also removes the trailing space flagged by yamllint.
-    permissions:
-      contents: write 
+    permissions:
+      contents: read
96-97: Avoid double installation of the project.

You install in “Ensure venv” and again in “Install project.” Keep one to reduce time and flakiness.
-          pip install .
-          touch "$MARKER"
+          touch "$MARKER"
(Keep the dedicated “Install project” step.)

Also applies to: 100-104

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc91c3e22afffbb80f1dbec199fd55d91748adc5 and 12ca66077563253c09d1530c948b9fcf28cbbdd1.

📒 Files selected for processing (2)

.github/workflows/metal_ci.yml (1 hunks)
tilelang/carver/arch/metal.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tilelang/carver/arch/metal.py (2)

tilelang/language/ast/ir.py (1)

target (1682-1713)

tilelang/carver/arch/arch_base.py (1)

TileDevice (4-37)

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

22-22: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

76-76: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 13-13: trailing spaces

(trailing-spaces)

🪛 Ruff (0.12.2)

tilelang/carver/arch/metal.py

17-20: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: format-check
GitHub Check: bot-task

🔇 Additional comments (2)

tilelang/carver/arch/metal.py (1)

6-7: Helper reads well.

is_metal_arch is straightforward and correct.
.github/workflows/metal_ci.yml (1)
1-1: Rename workflow and use Metal CMake flag.
CMakeLists.txt contains a USE_METAL guard — switch -DUSE_ROCM=ON to -DUSE_METAL=ON and rename the workflow.
-name: CI Test on AMD
+name: Metal CI
-        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_ROCM=ON; cd ..
+        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd ..

coderabbitai · 2025-09-12T06:53:58Z

+    def __init__(self, target: Target | str):
+        if isinstance(target, str):
+            target = Target(target)
+        self.target = target
+


⚠️ Potential issue

Call the base initializer to populate TileDevice fields.

Without super().__init__(), METAL instances won't have the base attributes (reg_cap, smem_cap, etc.), which can break consumers expecting them.

Apply:

class METAL(TileDevice): def __init__(self, target: Target | str): + super().__init__() if isinstance(target, str): target = Target(target) self.target = target

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(self, target: Target | str):

if isinstance(target, str):

target = Target(target)

self.target = target

def __init__(self, target: Target | str):

super().__init__()

if isinstance(target, str):

target = Target(target)

self.target = target

🤖 Prompt for AI Agents

In tilelang/carver/arch/metal.py around lines 11 to 15, the METAL class __init__ does not call the base class initializer so base TileDevice fields (e.g., reg_cap, smem_cap) are not populated; update the constructor to call super().__init__(target) (after converting a string target to Target if needed) so the parent initializer runs and base attributes are set, ensuring the instance has the expected fields.

@LeiWang1999

Removed external library DLL names for macOS. found during tile-ai/tilelang#799 cc @LeiWang1999

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (8)

tilelang/carver/arch/metal.py (1)

11-14: Call TileDevice.init to populate base fields (duplicate of prior feedback).

Without super().__init__(), instances miss required attributes (reg_cap, smem_cap, etc.), leading to runtime errors when accessed.

Apply:
 class METAL(TileDevice):
 
     def __init__(self, target: Target | str):
+        super().__init__()
         if isinstance(target, str):
             target = Target(target)
         self.target = target

.github/workflows/metal_ci.yml (7)

21-25: Update setup-python to v5 (v2 is deprecated on GH runners).

actionlint flags v2 as too old; move to v5.

-    - name: Set up Python
-      uses: actions/setup-python@v2
+    - name: Set up Python
+      uses: actions/setup-python@v5
       with:
         python-version: ${{ env.PYTHON_VERSION }}

29-31: Fix hashing on macOS: use shasum -a 256.

sha256sum isn’t available on macOS. Use shasum.

-        REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")
+        REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")
         MARKER="${{ runner.tool_cache }}/.venv_marker_${{ env.PYTHON_VERSION }}_${REQS_HASH:0:8}"

40-43: Remove invalid pip flag --no-user.

pip doesn’t support --no-user here (even in venvs).

-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
-          [[ -f requirements-test.txt ]] && \
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user
+          [[ -f requirements-test.txt ]] && \
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt

75-79: Update setup-python to v5 here as well.

-    - name: Set up Python
-      uses: actions/setup-python@v2
+    - name: Set up Python
+      uses: actions/setup-python@v5
       with:
         python-version: ${{ env.PYTHON_VERSION }}

83-85: Use a stable requirements hash (shasum) instead of cat.

cat doesn’t produce a fixed-length digest and weakens cache keys.

-        REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true)
+        REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")
         MARKER="${{ runner.tool_cache }}/.venv_marker_${{ env.PYTHON_VERSION }}_${REQS_HASH:0:8}"

93-97: Remove invalid --no-user and avoid double install.

Fix flags; also consider skipping the early install since you reinstall later.

           source "${{ runner.tool_cache }}/${{ env.VENV_DIR }}/bin/activate"
-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
           [[ -f requirements-test.txt ]] && \
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user
-          pip install . --no-user
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt
+          # Defer project install to the dedicated step below

100-104: Remove --no-user in the wheel install step.

-        pip install . --no-user -v
+        pip install . -v

🧹 Nitpick comments (6)

tilelang/carver/arch/metal.py (3)
17-20: Sort all to satisfy Ruff RUF022.
 __all__ = [
-    'is_metal_arch',
-    'METAL',
+    'METAL',
+    'is_metal_arch',
 ]
11-14: Optionally validate Target kind is metal.

Prevents accidental misuse with a non-Metal TVM target.
     def __init__(self, target: Target | str):
         super().__init__()
         if isinstance(target, str):
             target = Target(target)
+        # Ensure we were given a Metal target
+        if getattr(getattr(target, "kind", None), "name", None) != "metal":
+            raise ValueError(f"METAL requires a TVM Target(kind='metal'), got: {getattr(getattr(target, 'kind', None), 'name', None)}")
         self.target = target
Please confirm the TVM version exposes target.kind.name; if not, we can adapt the check.

9-15: Confirm whether get_avaliable_tensorintrin_shapes is required for Metal paths.

If invoked downstream, not overriding it will raise NotImplementedError from the base class.

Proposed minimal safe stub:
 class METAL(TileDevice):
 
     def __init__(self, target: Target | str):
         super().__init__()
         if isinstance(target, str):
             target = Target(target)
         self.target = target
+
+    def get_avaliable_tensorintrin_shapes(self):
+        # Metal/MPS: no tensor intrinsics exposed yet
+        return []
.github/workflows/metal_ci.yml (3)
12-14: Trim trailing space and drop unnecessary write permission.

Set minimal perms and remove the trailing space flagged by yamllint.
     permissions:
-      contents: write 
+      contents: read
10-10: Pin runner to Apple Silicon if you require MPS/Metal.

macos-latest can drift; pin to macos-14 (Apple Silicon) to ensure MPS availability.
-    runs-on: [macos-latest]
+    runs-on: macos-14
(Apply to both jobs.)

Also applies to: 63-63

26-45: Rethink venv “persistence”; use actions/cache instead.

runner.tool_cache doesn’t persist across jobs/runs on hosted macOS. Use actions/cache keyed by Python and requirements hash.

Example step to add before “Ensure venv”:
- name: Cache venv
  uses: actions/cache@v4
  with:
    path: ${{ runner.tool_cache }}/${{ env.VENV_DIR }}
    key: venv-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements*.txt', 'pyproject.toml', 'setup.cfg', 'setup.py') }}
Then create venv if missing and install deps; drop the custom MARKER logic.

Also applies to: 80-98

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12ca66077563253c09d1530c948b9fcf28cbbdd1 and d546487de7e3a2031c9106e89210669f666bbccd.

📒 Files selected for processing (2)

.github/workflows/metal_ci.yml (1 hunks)
tilelang/carver/arch/metal.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tilelang/carver/arch/metal.py (2)

tilelang/carver/arch/arch_base.py (1)

TileDevice (4-37)

tilelang/carver/template/base.py (1)

arch (152-159)

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

22-22: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

76-76: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 13-13: trailing spaces

(trailing-spaces)

🪛 Ruff (0.12.2)

tilelang/carver/arch/metal.py

17-20: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🔇 Additional comments (3)

tilelang/carver/arch/metal.py (1)

5-6: Helper looks good.
.github/workflows/metal_ci.yml (2)
51-51: Drop ROCm flag in Metal CI configure step.

This job targets Metal; forcing -DUSE_ROCM=ON on macOS is likely wrong and may fail. Prefer no ROCm flag (or a Metal-specific one if your CMake supports it).
-        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_ROCM=ON; cd ..
+        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON; cd ..
If your build supports it, consider adding: -DUSE_METAL=ON.

105-111: pytest-timeout declared — no change required

pytest-timeout is listed in requirements-test.txt (line 16); --timeout is supported so keep the workflow as-is.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

setup.py (2)
269-276: Fix possible TypeError when CMake isn't installed.

shutil.which can return None; os.path.exists(None) crashes.
 def get_cmake_path() -> str:
     """Return the path to the CMake compiler.
     """
-    # found which cmake is used
-    cmake_path = shutil.which("cmake")
-    if not os.path.exists(cmake_path):
+    cmake_path = shutil.which("cmake")
+    if not cmake_path or not os.path.exists(cmake_path):
         raise Exception("CMake is not installed, please install it first.")
     return cmake_path
829-844: Ensure CUDA is disabled when building for Metal.

Without explicitly setting USE_CUDA OFF here, CMake template defaults could enable CUDA inadvertently.
 if USE_METAL:
     content_lines += [
         "set(USE_METAL ON)",
         "set(USE_ROCM OFF)",
+        "set(USE_CUDA OFF)",
     ]

♻️ Duplicate comments (8)

setup.py (1)

36-43: Boolean env parsing still wrong; empty/unknown values fall back to default and '' branch is unreachable.

Normalize and handle both true/false tokens (incl. empty string). Current logic can flip GPU toggles unexpectedly.

Apply:

-def _read_bool_env(name: str, default: bool = False) -> bool:
-    if env := os.environ.get(name):
-        env = env.lower()
-        if env in ['on', '1', 'true']:
-            return True
-        elif env in ['', 'off', '0', 'false']:
-            return False
-    return default
+def _read_bool_env(name: str, default: bool = False) -> bool:
+    val = os.environ.get(name)
+    if val is None:
+        return default
+    s = val.strip().lower()
+    if s in ('1', 'true', 'on', 'yes', 'y'):
+        return True
+    if s in ('0', 'false', 'off', 'no', 'n', ''):
+        return False
+    return default

.github/workflows/metal_ci.yml (7)

93-97: Remove --no-user and avoid duplicate installs.

Install requirements here; leave project install to the dedicated step below.

-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
           [[ -f requirements-test.txt ]] && \
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user
-          pip install . --no-user
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt

21-25: Upgrade setup-python to v5.

v2 is deprecated and fails on current runners.

-    - name: Set up Python
-      uses: actions/setup-python@v2
+    - name: Set up Python
+      uses: actions/setup-python@v5
       with:
         python-version: ${{ env.PYTHON_VERSION }}

76-79: Upgrade setup-python to v5 (second job).

-    - name: Set up Python
-      uses: actions/setup-python@v2
+    - name: Set up Python
+      uses: actions/setup-python@v5
       with:
         python-version: ${{ env.PYTHON_VERSION }}

29-31: sha256sum isn’t available on macOS runners; use shasum -a 256.

-        REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")
+        REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")

83-85: Consistently hash requirements; don’t use cat.

-        REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true)
+        REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")

40-43: Drop invalid pip flag --no-user.

This option doesn’t exist and can fail even inside venvs.

-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
-          [[ -f requirements-test.txt ]] && \
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user
+          [[ -f requirements-test.txt ]] && \
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt

101-104: Remove --no-user (wheel install step).

-        pip install . --no-user -v
+        pip install . -v

🧹 Nitpick comments (5)

setup.py (4)

51-53: More robust Metal default detection.

Prefer platform checks that don’t rely on mac_ver quirks.

-IS_LINUX = platform.system() == 'Linux'
-MAYBE_METAL = platform.mac_ver()[2] == 'arm64'
+IS_LINUX = platform.system() == 'Linux'
+MAYBE_METAL = (platform.system() == 'Darwin' and platform.machine() == 'arm64')

102-106: Shorten exception message (Ruff TRY003) and keep details in logs.

Long inline messages trigger TRY003 and add noise.

-if USE_CUDA and not CUDA_HOME:
-    raise ValueError(
-        "CUDA support is enabled by default on linux if `USE_ROCM=False`, but CUDA_HOME is not set or detected.")
+if USE_CUDA and not CUDA_HOME:
+    raise ValueError("CUDA enabled but CUDA_HOME not set/detected.")

463-484: Use logger.warning instead of “WARNING:” via info.

This preserves severity and integrates with log filters.

-                logger.info(f"WARNING: {item} not found in any expected directories!")
+                logger.warning(f"{item} not found in any expected directories!")

886-889: Update metadata to reflect macOS/Metal support.

Expose correct OS in metadata; remove Linux-only bias when targeting Metal.

-    platforms=[
-        "Environment :: GPU :: NVIDIA CUDA" if not USE_ROCM else "Environment :: GPU :: AMD ROCm",
-        "Operating System :: POSIX :: Linux",
-    ],
+    platforms=[
+        ("Environment :: GPU :: NVIDIA CUDA" if USE_CUDA else
+         "Environment :: GPU :: AMD ROCm" if USE_ROCM else
+         "Environment :: GPU"),
+        "Operating System :: MacOS :: MacOS X" if USE_METAL else "Operating System :: POSIX :: Linux",
+    ],

-        "Operating System :: OS Independent",
+        ("Operating System :: MacOS :: MacOS X" if USE_METAL else "Operating System :: POSIX :: Linux"),

Also applies to: 893-900

.github/workflows/metal_ci.yml (1)

13-13: Trailing space.

Minor YAML lint fix.
-      contents: write 
+      contents: write

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d546487de7e3a2031c9106e89210669f666bbccd and 5b306179cb8b24f66118ff9847bad7a1847e82ff.

📒 Files selected for processing (2)

.github/workflows/metal_ci.yml (1 hunks)
setup.py (7 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

setup.py (1)

tilelang/env.py (1)

get (136-139)

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

22-22: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

76-76: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 13-13: trailing spaces

(trailing-spaces)

🪛 Ruff (0.12.2)

setup.py

103-105: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: format-check

🔇 Additional comments (1)

setup.py (1)

108-111: Linux-only accelerator requirement is fine.

The guard correctly avoids enforcing CUDA/ROCm on macOS/Metal builds.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

.github/workflows/ci.yml (2)

22-25: Upgrade to setup-python v5.

actions/setup-python@v2 is deprecated and flagged by actionlint.

-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v5

Also applies to: 78-81

40-45: Remove invalid pip install --no-user.

--no-user is not a valid pip flag (inside venvs it’s unnecessary anyway).

-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
-          [[ -f requirements-test.txt ]] && \
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user
+          [[ -f requirements-test.txt ]] && \
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt
-          pip install flash_attn==2.5.8 --no-user --no-build-isolation
+          pip install flash_attn==2.5.8 --no-build-isolation
-          pip install . --no-user
+          pip install .
-        pip install . --no-user -v
+        pip install . -v

Also applies to: 96-101, 107-107

.github/workflows/amd_ci.yml (1)

23-26: Upgrade to setup-python v5.

v2 is deprecated; actionlint will flag this.
-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v5
Also applies to: 77-80

♻️ Duplicate comments (3)

.github/workflows/metal_ci.yml (3)

21-25: Update setup-python to v5.

Same as earlier bot comment; apply in both jobs.

-    - name: Set up Python
-      uses: actions/setup-python@v2
+    - name: Set up Python
+      uses: actions/setup-python@v5
       with:
         python-version: ${{ env.PYTHON_VERSION }}

Also applies to: 76-79

29-31: Fix hashing on macOS (no sha256sum).

Use shasum -a 256 (portable on macOS). Apply to both blocks.

-        REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")
+        REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")

-        REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true)
+        REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")

Also applies to: 83-85

40-44: Remove invalid pip install --no-user.

Not a valid flag and unnecessary in venvs. Apply in all occurrences.

-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
-          [[ -f requirements-test.txt ]] && \
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user
+          [[ -f requirements-test.txt ]] && \
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt
-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
-          [[ -f requirements-test.txt ]] && \
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user
+          [[ -f requirements-test.txt ]] && \
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt
-        pip install . --no-user -v
+        pip install . -v

Also applies to: 93-96, 102-102

🧹 Nitpick comments (4)

.github/workflows/ci.yml (1)
12-14: Tighten permissions.

Format check doesn’t push commits; prefer read-only.
-    permissions:
-      contents: write 
+    permissions:
+      contents: read
.github/workflows/metal_ci.yml (3)
13-13: Trim trailing space.

YAMLlint flags trailing spaces.
-      contents: write 
+      contents: write
10-10: Target Apple Silicon explicitly for MPS.

To avoid Intel mac runners without MPS, prefer macOS 14 (Apple Silicon).
-    runs-on: [macos-latest]
+    runs-on: [macos-14]
Please confirm your org has access to macOS-14 ARM runners; otherwise keep latest and skip MPS tests when unavailable.

Also applies to: 63-63

104-109: Add a quick MPS sanity check before tests.

Fail fast if MPS isn’t available.
     - name: Run metal test
       run: |
         source "${{ runner.tool_cache }}/${{ env.VENV_DIR }}/bin/activate"
+        python - <<'PY'
+import sys
+try:
+    import torch
+    print("torch:", torch.__version__, "MPS available:", torch.backends.mps.is_available())
+    assert torch.backends.mps.is_available(), "MPS not available on this runner"
+except Exception as e:
+    print("Precheck failed:", e, file=sys.stderr)
+    sys.exit(1)
+PY
         cd testing/python
         unset PYTHONPATH
         python -m pytest -k metal -v -r fE --durations=0 --timeout=3600

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b306179cb8b24f66118ff9847bad7a1847e82ff and 65ffe111a4df5414ce44a0a14c5113c6491e35c8.

📒 Files selected for processing (3)

.github/workflows/amd_ci.yml (1 hunks)
.github/workflows/ci.yml (1 hunks)
.github/workflows/metal_ci.yml (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

22-22: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

76-76: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 13-13: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-test-metal

coderabbitai · 2025-09-12T09:22:57Z

+        git submodule update --init --recursive --recommend-shallow
+        mkdir -p build
+        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd ..


🛠️ Refactor suggestion

Guard --recommend-shallow for older Git on macOS.

Add a fallback to keep runners with older Git working.

- git submodule update --init --recursive --recommend-shallow + if git submodule update -h 2>&1 | grep -q -- '--recommend-shallow'; then + git submodule update --init --recursive --recommend-shallow + else + git submodule update --init --recursive + fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

git submodule update --init --recursive --recommend-shallow

mkdir -p build

cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd ..

if git submodule update -h 2>&1 | grep -q -- '--recommend-shallow'; then

git submodule update --init --recursive --recommend-shallow

else

git submodule update --init --recursive

fi

mkdir -p build

cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd ..

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

.github/workflows/metal_ci.yml (5)
16-16: Remove trailing whitespace (yamllint violation).

Line 16 has a trailing space after "write".
-      contents: write 
+      contents: write
40-50: Make format-check fail deterministically on formatting drift.

Currently you fail only when format.sh exits non‑zero; if the script exits 0 but emits the sentinel text, the job passes. Harden the logic.
-        if ! output=$(./format.sh 2>&1); then
-          echo "------------------------------------"
-          echo "message:"
-          echo "$output"
-          printf '%s\n' "$output" | grep "Please review and stage the changes."
-          echo "------------------------------------"
-          exit 1
-        fi
+        set -euo pipefail
+        status=0
+        output="$(./format.sh 2>&1)" || status=$?
+        if [ "$status" -ne 0 ] || printf '%s\n' "$output" | grep -q "Please review and stage the changes."; then
+          echo "------------------------------------"
+          echo "message:"
+          printf '%s\n' "$output"
+          echo "------------------------------------"
+          exit 1
+        fi
13-13: Pin to Apple Silicon runners for MPS/Metal determinism.

To ensure MPS availability and consistent behavior, prefer macOS 14 (Apple Silicon) explicitly rather than the moving alias macos-latest.
-    runs-on: [macos-latest]
+    runs-on: macos-14
Apply to both jobs after confirming that your Metal/MPS tests require Apple Silicon on GitHub-hosted runners.

Also applies to: 54-54

15-17: Use least-privilege permissions (write not needed).

This job doesn’t push changes; read is sufficient.
-    permissions:
-      contents: write
+    permissions:
+      contents: read
82-86: Use working-directory instead of cd; keep PYTHONPATH unset.

Cleaner, avoids subshell state and makes logs clearer.
-    - name: Run metal test
-      run: |
-        cd testing/python
-        unset PYTHONPATH
-        python -m pytest -k metal -v -r fE --durations=0 --timeout=3600
+    - name: Run metal test
+      working-directory: testing/python
+      run: |
+        unset PYTHONPATH
+        python -m pytest -k metal -v -r fE --durations=0 --timeout=3600

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65ffe111a4df5414ce44a0a14c5113c6491e35c8 and ca6546ac34e89cde32673641f20a0aca904533c8.

📒 Files selected for processing (1)

.github/workflows/metal_ci.yml (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

28-28: property "os" is not defined in object type {}

(expression)

68-68: property "os" is not defined in object type {}

(expression)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 16-16: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: format-check
GitHub Check: bot-task

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

.github/workflows/metal_ci.yml (1)
63-63: Fix invalid cache key reference to matrix.os (actionlint failure).

Replace undefined matrix.os with runner.os and include runner.arch for uniqueness.
-        key: ${{ github.job }}-${{ matrix.os }}
+        key: ${{ github.job }}-${{ runner.os }}-${{ runner.arch }}

🧹 Nitpick comments (5)

.github/workflows/metal_ci.yml (5)

16-16: Remove trailing space (yamllint).

-      contents: write 
+      contents: write

15-17: Scope down permissions to least privilege.

This job doesn’t appear to need write. Prefer read.

     permissions:
-      contents: write
+      contents: read

33-46: Make format check fail on detected changes even if script exits 0.

Currently failure only triggers when format.sh returns non-zero. Also fail on the marker text and on any git diff.

     - name: Run format check
       run: |
-        source .venv/bin/activate
-        mkdir -p build
-        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd ..
-        if ! output=$(./format.sh 2>&1); then
-          echo "------------------------------------"
-          echo "message:"
-          echo "$output"
-          printf '%s\n' "$output" | grep "Please review and stage the changes."
-          echo "------------------------------------"
-          exit 1
-        fi
-        rm -rf build
+        set -euo pipefail
+        source .venv/bin/activate
+        mkdir -p build
+        (cd build && cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON)
+        output=$(./format.sh 2>&1) || true
+        echo "------------------------------------"
+        echo "$output"
+        echo "------------------------------------"
+        if printf '%s\n' "$output" | grep -q "Please review and stage the changes."; then
+          exit 1
+        fi
+        if ! git diff --quiet; then
+          git --no-pager diff
+          exit 1
+        fi
+        rm -rf build

13-13: Consider pinning to macos-14 for stable Apple Silicon/MPS runners.

Reduces surprises from future macos-latest changes.

-    runs-on: [macos-latest]
+    runs-on: macos-14

Also applies to: 49-49

77-81: Optional: print ccache stats after tests.

Quick visibility into cache effectiveness.

     - name: Run metal test
       run: |
         cd testing/python
         unset PYTHONPATH
         python -m pytest -k metal -v -r fE --durations=0 --timeout=3600
+
+    - name: ccache stats
+      run: ccache -s || true

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca6546ac34e89cde32673641f20a0aca904533c8 and e9cd15f744d1cb61534c907cad248305982953e0.

📒 Files selected for processing (1)

.github/workflows/metal_ci.yml (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

63-63: property "os" is not defined in object type {}

(expression)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 16-16: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: bot-task

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

.github/workflows/metal_ci.yml (1)
64-64: Fix cache key: matrix.os is undefined in this workflow
Using matrix.os without a matrix breaks actionlint and can affect cache keys. Use runner context.
-        key: ${{ github.job }}-${{ matrix.os }}
+        key: ${{ github.job }}-${{ runner.os }}-${{ runner.arch }}

🧹 Nitpick comments (3)

.github/workflows/metal_ci.yml (3)
14-14: Trim trailing whitespace (yamllint error)

There’s a trailing space after write flagged by YAMLlint.
-      contents: write 
+      contents: write
11-11: Pin macOS runner to Apple Silicon image for stable MPS/Metal
macos-latest can shift. Pin to macos-14 (or your validated version) to ensure consistent MPS/Metal availability.
-    runs-on: [macos-latest]
+    runs-on: macos-14
(Apply to both jobs.)

Also applies to: 47-47

13-15: Restrict permissions to least privilege
This job doesn’t push changes; contents: read suffices.
-    permissions:
-      contents: write
+    permissions:
+      contents: read

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9cd15f744d1cb61534c907cad248305982953e0 and 98336a39372308bb8964379727b6155dc665cdb3.

📒 Files selected for processing (2)

.github/workflows/metal_ci.yml (1 hunks)
format.sh (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

64-64: property "os" is not defined in object type {}

(expression)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 14-14: trailing spaces

(trailing-spaces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: format-check
GitHub Check: bot-task

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

.github/workflows/ci.yml (3)
124-125: Stabilize Metal job: pin Apple Silicon runner and preflight MPS availability.

To avoid drift if macos-latest changes, pin to an Apple Silicon image (e.g., macos-14). Add a quick MPS check to fail fast with a clear message if MPS isn’t available.

Apply:
-    runs-on: [macos-latest]
+    runs-on: macos-14
Insert before tests:
+    - name: Verify MPS availability
+      run: |
+        python - <<'PY'
+        import sys
+        try:
+          import torch
+          ok = getattr(torch.backends, "mps", None) and torch.backends.mps.is_available()
+        except Exception:
+          ok = False
+        print(f"MPS available: {ok}")
+        sys.exit(0 if ok else 1)
+        PY
Also applies to: 156-160

133-136: Improve checkout completeness for submodules on macOS.

Ensure the whole history is available when submodules/versions require it.

Apply:
       with:
         submodules: recursive
+        fetch-depth: 0
12-14: Least-privilege: format-check likely only needs read permission.

Unless you auto-commit formatting, drop to contents: read.

Apply:
-    permissions:
-      contents: write 
+    permissions:
+      contents: read

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98336a39372308bb8964379727b6155dc665cdb3 and fd8a47bf287afd3a2435815e3b9370d4e2ec4d49.

📒 Files selected for processing (1)

.github/workflows/ci.yml (2 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/ci.yml

141-141: property "os" is not defined in object type {}

(expression)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: format-check
GitHub Check: bot-task

coderabbitai · 2025-09-12T11:20:18Z

@@ -7,7 +7,7 @@ env:

 jobs:
  format-check:
-    runs-on: [self-hosted, nvidia]
+    runs-on: [ubuntu-latest]


⚠️ Potential issue

Make format-check runner GPU-agnostic: drop flash-attn build and don’t force CUDA in CMake.

On ubuntu-latest, building flash_attn commonly fails (no CUDA toolchain). Also forcing -DUSE_CUDA=ON can make CMake error out on non-GPU runners. These can break formatting CI unrelated to GPU.

Apply:

- pip install flash_attn==2.5.8 --no-user --no-build-isolation + # GPU-specific deps are not needed for format check; avoid fragile builds + true

- cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=ON; cd .. + # Generate compile_commands.json without assuming CUDA + cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=OFF; cd ..

Also applies to: 41-45, 53-54

🤖 Prompt for AI Agents

.github/workflows/ci.yml around line 10 (and also update the same changes at lines 41-45 and 53-54): the format-check runner is currently tied to GPU tooling — remove the step that builds flash-attn and stop forcing CUDA in CMake for this runner; update the job matrix/steps so the format-check job does not invoke building flash-attn and remove or conditionally set any -DUSE_CUDA=ON CMake flags (only enable CUDA on GPU-capable jobs), ensuring the workflow uses CPU-only defaults so formatting CI will run on ubuntu-latest without requiring a CUDA toolchain.

coderabbitai · 2025-09-12T11:20:18Z

+      uses: hendrikmuhs/ccache-action@v1.2
+      with:
+        create-symlink: true
+        key: ${{ github.job }}-${{ matrix.os }}
+


⚠️ Potential issue

Fix ccache key: matrix.os is undefined (will fail at runtime).

matrix.os isn’t defined in this job (no matrix), which actionlint already flags. Use runner.os and include Python/version and requirements hash to improve cache hits.

Apply:

- with: - create-symlink: true - key: ${{ github.job }}-${{ matrix.os }} + with: + create-symlink: true + key: ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-py${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements-*.txt') }} + restore-keys: | + ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

uses: hendrikmuhs/ccache-action@v1.2

with:

create-symlink: true

key: ${{ github.job }}-${{ matrix.os }}

uses: hendrikmuhs/ccache-action@v1.2

with:

create-symlink: true

key: ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-py${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements-*.txt') }}

restore-keys: |

${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-

🧰 Tools

🪛 actionlint (1.7.7)

141-141: property "os" is not defined in object type {}

(expression)

🤖 Prompt for AI Agents

In .github/workflows/ci.yml around lines 138 to 142 the ccache key uses matrix.os which is undefined and will fail; replace matrix.os with runner.os and expand the key to include stable identifiers like the Python version and a hash of requirements (or lockfile) to improve cache correctness and hit rate — e.g. construct the key as a combination of github.job, runner.os, the python-version variable, and a requirements.txt (or poetry/poetry.lock) hash so the cache is unique per OS/Python/deps and will not error at runtime.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

.github/workflows/amd_ci.yml (1)
64-120: Remove invalid pip flag and modernize Python setup

pip --no-user is invalid and will fail; remove it (both install locations).

Update actions/setup-python to v5 in this job too.
-    - name: Set up Python
-      uses: actions/setup-python@v2
+    - name: Set up Python
+      uses: actions/setup-python@v5
@@
-          USE_ROCM=True pip install . --no-user
+          USE_ROCM=True pip install .
@@
-        USE_ROCM=True pip install . --no-user
+        USE_ROCM=True pip install .
.github/workflows/metal_ci.yml (1)
88-94: Install the built wheel explicitly

Avoid picking up sdists (which could rebuild without USE_METAL). Install only the wheel.
-        uv pip install dist/*
+        uv pip install dist/*.whl

♻️ Duplicate comments (2)

.github/workflows/metal_ci.yml (2)
82-87: Don’t source .venv with setup-uv; ensure Metal build flag set

setup-uv already activates the env; .venv/bin/activate may not exist and will fail. Also ensure Metal is enabled during build.
-        source .venv/bin/activate
-        uv pip install build wheel ninja cmake
-        python -mbuild -w --no-isolation
+        uv pip install build wheel ninja cmake
+        USE_METAL=1 python -m build -w --no-isolation
65-70: Fix undefined matrix.os in ccache key (actionlint)

matrix.os isn’t defined; use runner context to keep keys valid and unique.
-        key: ${{ github.job }}-${{ matrix.os }}
+        key: ${{ github.job }}-${{ runner.os }}-${{ runner.arch }}

🧹 Nitpick comments (4)

setup.py (1)
102-106: Improve error message clarity.

The error message for missing CUDA_HOME uses a backslash for line continuation which could be confusing. Consider using parentheses for cleaner multi-line strings.

Apply this diff to improve readability:
-        "CUDA support is enabled by default on linux if `USE_ROCM=False`," \
-        " but CUDA_HOME is not set or detected.")
+        ("CUDA support is enabled by default on linux if `USE_ROCM=False`, "
+         "but CUDA_HOME is not set or detected."))
.github/workflows/amd_ci.yml (1)
9-62: Gating with if: false — consider a toggle and update setup-python

Replace hardcoded if: false with a single env/input (e.g., ENABLE_AMD_CI) so both jobs can be enabled together without editing YAML.

Update actions/setup-python from v2 to v5 to avoid deprecation issues.
-  format-check:
-    if: false
+  format-check:
+    if: ${{ env.ENABLE_AMD_CI == 'true' }}
@@
-    - name: Set up Python
-      uses: actions/setup-python@v2
+    - name: Set up Python
+      uses: actions/setup-python@v5
.github/workflows/metal_ci.yml (2)
10-11: Pin runner to Apple Silicon

Metal/MPS tests may require Apple Silicon. Prefer macos-14 (arm64) over macos-latest to avoid Intel runners.
-    runs-on: [macos-latest]
+    runs-on: macos-14
12-14: Tighten permissions and fix trailing space

Use least-privilege and remove trailing space flagged by yamllint.
-    permissions:
-      contents: write 
+    permissions:
+      contents: read

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd8a47bf287afd3a2435815e3b9370d4e2ec4d49 and 32edb2446da0b9e707397767677eb68b032d9962.

📒 Files selected for processing (4)

.github/workflows/amd_ci.yml (2 hunks)
.github/workflows/ci.yml (2 hunks)
.github/workflows/metal_ci.yml (1 hunks)
setup.py (11 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/ci.yml

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

69-69: property "os" is not defined in object type {}

(expression)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 13-13: trailing spaces

(trailing-spaces)

🪛 Ruff (0.12.2)

setup.py

103-105: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-metal
GitHub Check: bot-task

🔇 Additional comments (9)

setup.py (8)

36-43: LGTM! Boolean environment parsing is now more robust.

The updated _read_bool_env function correctly handles both truthy and falsy values explicitly, including empty strings as false. This prevents the previous issue where explicit false values like "0" or "false" would incorrectly return the default.

65-81: LGTM! Platform-specific library naming is correctly handled.

The prebuild items lists are properly differentiated between Linux (.so) and macOS (.dylib) shared libraries, with appropriate deletion items for macOS debug symbols.

614-622: LGTM! CMakeExtension constructor is properly updated.

The addition of **kwargs parameter allows for flexible extension configuration while maintaining backward compatibility.

669-669: LGTM! Platform-specific file globbing is correctly implemented.

The dynamic selection between .so and .dylib files based on the platform is appropriate for cross-platform compatibility.

700-710: Document Metal-specific Cython compilation issue.

The comment explains the issue with Cython compilation on Metal/macOS requiring libpython linkage. This is a reasonable workaround until the issue is investigated further.

761-763: Document the macOS Cython linking issue.

The FIXME comment correctly identifies that aarch64 Darwin needs additional Python library linking. This is consistent with the earlier Metal-specific handling.

822-857: LGTM! CMake configuration logic properly handles Metal backend.

The Metal-first configuration approach is well-implemented:

Metal takes precedence when USE_METAL is enabled

Proper fallback to ROCM/CUDA backends

Correct mutual exclusion of backends

487-494: LGTM! Cleanup logic for platform-specific artifacts is properly implemented.

The loop handles deletion of macOS-specific build artifacts (like .dSYM files) that need to be removed during the build process.

.github/workflows/metal_ci.yml (1)

35-49: Format step looks good

CMake prep with USE_METAL=ON and clean failure output are clear.

coderabbitai · 2025-09-12T13:35:43Z

+IS_LINUX = platform.system() == 'Linux'
+MAYBE_METAL = platform.mac_ver()[2] == 'arm64'
+
 # Add LLVM control environment variable
-USE_LLVM = os.environ.get("USE_LLVM", "False").lower() == "true"
+USE_LLVM = _read_bool_env('USE_LLVM')
+# Add ROCM control environment variable
+USE_ROCM = _read_bool_env("USE_ROCM")
+# Add ROCM control environment variable
+USE_METAL = _read_bool_env("USE_METAL", MAYBE_METAL)
 # Add ROCM control environment variable
-USE_ROCM = os.environ.get("USE_ROCM", "False").lower() == "true"
+USE_CUDA = _read_bool_env("USE_CUDA", IS_LINUX and not USE_ROCM)
 # Build with Debug mode


💡 Verification agent

🧩 Analysis chain

Verify Metal availability detection logic.

The MAYBE_METAL variable checks for arm64 architecture on macOS, but this may not be sufficient for determining Metal support availability. Consider if additional checks are needed for Metal framework availability.

🏁 Script executed:

#!/bin/bash # Description: Check if Metal detection logic in other files is consistent with setup.py # Search for Metal availability checks in the codebase rg -n "check_metal_availability|metal.*availability" --type=py -A3 -B3 # Look for platform-specific Metal checks rg -n "mac_ver|arm64|MAYBE_METAL" --type=py -A2 -B2

Length of output: 2480

Replace arch-only Metal detection with feature-based detection

Both setup.py (MAYBE_METAL = platform.mac_ver()[2] == 'arm64') and tilelang/utils/target.py (check_metal_availability returning arch == 'arm64') assume Metal only on arm64 and thus exclude Intel Macs that may support Metal. Replace these ad-hoc checks with a single helper that verifies macOS release and Metal framework availability (e.g., presence of /System/Library/Frameworks/Metal.framework or attempting to load the framework) and use it both in setup.py and target selection.

Locations: setup.py (MAYBE_METAL, ~lines 51–62); tilelang/utils/target.py (check_metal_availability, ~lines 46–51; used at ~line 84).

🤖 Prompt for AI Agents

In setup.py around lines 51–62 and tilelang/utils/target.py around lines 46–51 and usage at ~84, replace the arch-only Metal detection with a feature-based helper: add a cross-file helper is_metal_available() (in a shared util module or a small platform_helpers.py) that returns True only if platform.system() == 'Darwin', macOS release is new enough (parse platform.mac_ver()[0] or platform.release()), and the Metal framework is present (check existence of /System/Library/Frameworks/Metal.framework or attempt to load it via ctypes.CDLL); then change MAYBE_METAL in setup.py to call is_metal_available() instead of checking arm64, and replace check_metal_availability / arch == 'arm64' in tilelang/utils/target.py to call this new helper so both files use the same robust feature-detection logic.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

.github/workflows/amd_ci.yml (2)

41-45: Remove invalid pip flag --no-user.

pip install --no-user isn’t a valid option and will fail.

-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
-          [[ -f requirements-test.txt ]] && \
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user
+          [[ -f requirements-test.txt ]] && \
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt

98-105: Remove --no-user in remaining pip installs.

-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
@@
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-rocm.txt
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-rocm.txt
@@
-          USE_ROCM=True pip install . --no-user
+          USE_ROCM=True pip install .
@@
-        USE_ROCM=True pip install . --no-user
+        USE_ROCM=True pip install .

Also applies to: 112-112

setup.py (2)

271-279: get_cmake_path can raise TypeError when CMake is missing.

Guard None before os.path.exists.

 def get_cmake_path() -> str:
@@
-    cmake_path = shutil.which("cmake")
-    if not os.path.exists(cmake_path):
+    cmake_path = shutil.which("cmake")
+    if not cmake_path or not os.path.exists(cmake_path):
         raise Exception("CMake is not installed, please install it first.")
     return cmake_path

843-857: Honor USE_CUDA=False and explicitly disable CUDA when using Metal.

Current logic enables CUDA if CUDA_HOME is set even when USE_CUDA=0, and Metal path doesn’t force USE_CUDA OFF.

-        if USE_METAL:
+        if USE_METAL:
             content_lines += [
                 "set(USE_METAL ON)",
                 "set(USE_ROCM OFF)",
+                "set(USE_CUDA OFF)",
             ]
-        elif USE_ROCM:
+        elif USE_ROCM:
             content_lines += [
                 f"set(USE_ROCM {ROCM_HOME})",
                 "set(USE_CUDA OFF)",
             ]
-        elif CUDA_HOME:
+        elif USE_CUDA and CUDA_HOME:
             content_lines += [
                 f"set(USE_CUDA {CUDA_HOME})",
                 "set(USE_ROCM OFF)",
             ]

♻️ Duplicate comments (4)

.github/workflows/ci.yml (1)

41-45: Remove invalid --no-user and avoid fragile GPU deps in format-check.

-          python -m pip install --upgrade pip --no-user
+          python -m pip install --upgrade pip
-          [[ -f requirements-test.txt ]] && \
-            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user
+          [[ -f requirements-test.txt ]] && \
+            PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt
-          pip install flash_attn==2.5.8 --no-user --no-build-isolation
+          # GPU-specific deps not needed for formatting
+          true
@@
-          pip install flash_attn==2.5.8 --no-user --no-build-isolation
-          pip install . --no-user
+          # Skip GPU-only deps here
+          pip install .
@@
-        pip install . --no-user -v
+        pip install . -v

-        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=ON; cd ..
+        # Do not force CUDA for formatting
+        cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=OFF; cd ..

Also applies to: 101-103, 109-109

.github/workflows/metal_ci.yml (2)

84-86: Don’t source .venv when using setup-uv with activation.

The UV action already activates the environment; .venv may not exist.

-        source .venv/bin/activate
-        uv pip install -v --no-build-isolation .
+        uv pip install -v --no-build-isolation .

69-69: Fix cache key: matrix.os is undefined (actionlint error).

Use runner context to avoid runtime failure.

-        key: ${{ github.job }}-${{ matrix.os }}
+        key: ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-${{ runner.arch }}-py${{ env.PYTHON_VERSION }}
+        restore-keys: |
+          ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-

setup.py (1)

36-44: Boolean env parser misinterprets explicit false values (and ignores empty string).

if env := os.environ.get(name): skips ''; elif env in ['', ...] is unreachable. This can enable backends unintentionally.

-def _read_bool_env(name: str, default: bool = False) -> bool:
-    if env := os.environ.get(name):
-        env = env.lower()
-        if env in ['on', '1', 'true']:
-            return True
-        elif env in ['', 'off', '0', 'false']:
-            return False
-    return default
+def _read_bool_env(name: str, default: bool = False) -> bool:
+    val = os.environ.get(name)
+    if val is None:
+        return default
+    s = val.strip().lower()
+    if s in ("1", "true", "on", "yes", "y"):
+        return True
+    if s in ("0", "false", "off", "no", "n", ""):
+        return False
+    return default

🧹 Nitpick comments (13)

.github/workflows/amd_ci.yml (4)
11-11: Don’t hard-disable AMD CI; gate via repo var or input.

Replace if: false with a controllable switch so maintainers can re-enable without code changes.
-    if: false
+    if: ${{ vars.DISABLE_AMD_CI != '1' }}
65-65: Same here: avoid permanent disable.

Use a variable-gated condition instead of if: false.
-    if: false
+    if: ${{ vars.DISABLE_AMD_CI != '1' }}
24-26: Upgrade setup-python to v5.

actions/setup-python@v2 is deprecated.
-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v5
       with:
         python-version: ${{ env.PYTHON_VERSION }}
Also applies to: 79-81

15-15: Trim trailing whitespace in permissions.
-      contents: write 
+      contents: write
.github/workflows/ci.yml (4)
10-10: Avoid permanent disable; gate via a repo variable.
-    if: false
+    if: ${{ vars.DISABLE_NVIDIA_CI != '1' }}
66-66: Same for build-test-nvidia: use a toggle instead of if: false.
-    if: false
+    if: ${{ vars.DISABLE_NVIDIA_CI != '1' }}
23-25: Update to actions/setup-python@v5.
-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v5
       with:
         python-version: ${{ env.PYTHON_VERSION }}
Also applies to: 80-82

14-14: Trailing space after contents: write.
-      contents: write 
+      contents: write
.github/workflows/metal_ci.yml (2)
13-13: Trim trailing whitespace.
-      contents: write 
+      contents: write
40-40: Consider generator parity with setup.py.

If the build expects Ninja elsewhere, specify -G Ninja here too for consistency, or rely on default uniformly.
setup.py (3)
51-53: Metal detection: prefer platform.system()/platform.machine().

platform.mac_ver() is brittle; simplify and be explicit.
-IS_LINUX = platform.system() == 'Linux'
-MAYBE_METAL = platform.mac_ver()[2] == 'arm64'
+IS_LINUX = platform.system() == "Linux"
+MAYBE_METAL = (platform.system() == "Darwin" and platform.machine() == "arm64")
900-903: Classifiers/platforms: include macOS when Metal builds are possible.
-    platforms=[
-        "Environment :: GPU :: NVIDIA CUDA" if not USE_ROCM else "Environment :: GPU :: AMD ROCm",
-        "Operating System :: POSIX :: Linux",
-    ],
+    platforms=[
+        ("Environment :: GPU :: NVIDIA CUDA" if (USE_CUDA and not USE_ROCM) else
+         "Environment :: GPU :: AMD ROCm" if USE_ROCM else
+         "Environment :: GPU :: Apple Metal"),
+        ("Operating System :: MacOS" if USE_METAL else "Operating System :: POSIX :: Linux"),
+    ],
103-106: Long exception message (TRY003).

Consider shortening or using a constant; non-blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd8a47bf287afd3a2435815e3b9370d4e2ec4d49 and 608fef209e5a99d4fe6a6cc6d1feb5b1abcf7092.

📒 Files selected for processing (4)

.github/workflows/amd_ci.yml (2 hunks)
.github/workflows/ci.yml (2 hunks)
.github/workflows/metal_ci.yml (1 hunks)
setup.py (11 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

69-69: property "os" is not defined in object type {}

(expression)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 13-13: trailing spaces

(trailing-spaces)

🪛 Ruff (0.12.2)

setup.py

103-105: Avoid specifying long messages outside the exception class

(TRY003)

coderabbitai · 2025-09-12T13:41:14Z

+        python-version: ${{ env.PYTHON_VERSION }}
+
+    - name: Ensure venv (local & persistent)
+      run: uv pip install -r requirements-test.txt -r requirements-build.txt


⚠️ Potential issue

Guard optional requirements to avoid failures if the file is missing.

- run: uv pip install -r requirements-test.txt -r requirements-build.txt + run: | + set -e + [[ -f requirements-test.txt ]] && uv pip install -r requirements-test.txt + [[ -f requirements-build.txt ]] && uv pip install -r requirements-build.txt

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

run: uv pip install -r requirements-test.txt -r requirements-build.txt

run: |

set -e

[[ -f requirements-test.txt ]] && uv pip install -r requirements-test.txt

[[ -f requirements-build.txt ]] && uv pip install -r requirements-build.txt

🤖 Prompt for AI Agents

In .github/workflows/metal_ci.yml around line 80, the step currently tries to install optional requirement files unconditionally which will fail the job if those files are missing; change the run command to detect each optional requirements file (e.g., requirements-test.txt and requirements-build.txt) and only include them in the pip install command if they exist (use a shell conditional to build the -r arguments or test -f before calling pip), ensuring that missing files are skipped but real install errors still fail the job.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

setup.py (1)

271-279: get_cmake_path can raise TypeError when cmake is missing

shutil.which() returns None; os.path.exists(None) throws. Guard None explicitly.

 def get_cmake_path() -> str:
@@
-    cmake_path = shutil.which("cmake")
-    if not os.path.exists(cmake_path):
-        raise Exception("CMake is not installed, please install it first.")
-    return cmake_path
+    cmake_path = shutil.which("cmake")
+    if not cmake_path or not os.path.exists(cmake_path):
+        raise RuntimeError("CMake is not installed or not on PATH.")
+    return cmake_path

♻️ Duplicate comments (5)

.github/workflows/metal_ci.yml (3)

84-85: Don’t source .venv with setup-uv; it’s already activated

.setup-uv with activate-environment: true exports the env; .venv may not exist.
-        source .venv/bin/activate
-        uv pip install -v --no-build-isolation .
+        uv pip install -v --no-build-isolation .
69-69: Fix actionlint error: undefined matrix.os in cache key

No matrix is defined; this breaks actionlint and can impact cache reuse.
-        key: ${{ github.job }}-${{ matrix.os }}
+        key: ${{ github.job }}-${{ runner.os }}-${{ runner.arch }}
79-81: Guard optional requirement files

Unconditionally installing both files fails if one is absent.
-      run: uv pip install -r requirements-test.txt -r requirements-build.txt
+      run: |
+        set -euo pipefail
+        [[ -f requirements-test.txt ]] && uv pip install -r requirements-test.txt
+        [[ -f requirements-build.txt ]] && uv pip install -r requirements-build.txt

setup.py (2)

51-62: Prefer feature-based Metal detection over arch-only heuristic

Intel Macs can support Metal; arch-only check excludes them. Centralize detection (Darwin + Metal.framework presence) and reuse.

-IS_LINUX = platform.system() == 'Linux'
-MAYBE_METAL = platform.mac_ver()[2] == 'arm64'
+IS_LINUX = platform.system() == 'Linux'
+
+def _has_metal() -> bool:
+    if platform.system() != 'Darwin':
+        return False
+    return os.path.exists('/System/Library/Frameworks/Metal.framework')
+
+MAYBE_METAL = _has_metal()

36-43: Boolean env parsing bug: empty string never reaches the “false” branch

The truthy guard prevents '' from being treated as false; explicit “0/false/off/''” can incorrectly return default.

-def _read_bool_env(name: str, default: bool = False) -> bool:
-    if env := os.environ.get(name):
-        env = env.lower()
-        if env in ['on', '1', 'true']:
-            return True
-        elif env in ['', 'off', '0', 'false']:
-            return False
-    return default
+def _read_bool_env(name: str, default: bool = False) -> bool:
+    val = os.environ.get(name)
+    if val is None:
+        return default
+    s = val.strip().lower()
+    if s in ('1', 'true', 'on', 'yes', 'y'):
+        return True
+    if s in ('0', 'false', 'off', 'no', 'n', ''):
+        return False
+    return default

🧹 Nitpick comments (4)

.github/workflows/metal_ci.yml (2)

12-14: Drop unnecessary write permission and fix trailing space

This job doesn't publish artifacts; reduce to read-only and remove trailing space on Line 13.
-    permissions:
-      contents: write 
+    permissions:
+      contents: read
71-78: Ensure Ninja is present or fall back to default generator

CMake is invoked with “-G Ninja” in setup.py; missing Ninja will fail on some mac runners.
     - name: Install python via uv
       uses: astral-sh/setup-uv@v6
       with:
         enable-cache: true
         ignore-nothing-to-cache: true
         activate-environment: true
         python-version: ${{ env.PYTHON_VERSION }}
+
+    - name: Ensure Ninja
+      run: |
+        if ! command -v ninja >/dev/null 2>&1; then
+          brew update && brew install ninja
+        fi
Optionally, make setup.py fall back to default generator if Ninja is absent.

setup.py (2)

823-825: Prefer Python3_EXECUTABLE for CMake

FindPython prefers Python3_EXECUTABLE; improves compatibility across CMake toolchains.

-            f"-DPython_EXECUTABLE={sys.executable}",
+            f"-DPython3_EXECUTABLE={sys.executable}",

844-859: Always emit mutually exclusive GPU flags

When CUDA is selected, set USE_ROCM OFF; when ROCm is selected, set USE_CUDA OFF; when neither is selected, set both OFF explicitly to avoid stale cache effects.

         if USE_METAL:
             content_lines += [
                 "set(USE_METAL ON)",
-                "set(USE_ROCM OFF)",
+                "set(USE_ROCM OFF)",
+                "set(USE_CUDA OFF)",
             ]
         elif USE_ROCM:
             content_lines += [
                 f"set(USE_ROCM {ROCM_HOME})",
                 "set(USE_CUDA OFF)",
             ]
         elif CUDA_HOME:
             content_lines += [
                 f"set(USE_CUDA {CUDA_HOME})",
                 "set(USE_ROCM OFF)",
             ]
+        else:
+            content_lines += [
+                "set(USE_CUDA OFF)",
+                "set(USE_ROCM OFF)",
+            ]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 608fef209e5a99d4fe6a6cc6d1feb5b1abcf7092 and 4c42d7eea70d878a010888bc4944585c059109e7.

📒 Files selected for processing (2)

.github/workflows/metal_ci.yml (1 hunks)
setup.py (11 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/metal_ci.yml

69-69: property "os" is not defined in object type {}

(expression)

🪛 YAMLlint (1.37.1)

.github/workflows/metal_ci.yml

[error] 13-13: trailing spaces

(trailing-spaces)

🪛 Ruff (0.12.2)

setup.py

103-105: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-test-metal
GitHub Check: build-test-amd
GitHub Check: bot-task

🔇 Additional comments (2)

setup.py (2)
102-111: Linux-only GPU requirement may be too strict

This hard-errors on Linux without CUDA/ROCM even for CPU-only users or CI. Confirm this is intentional; otherwise downgrade to warning or gate behind env.

If relaxing is desired:
-if IS_LINUX and not (CUDA_HOME or ROCM_HOME):
-    raise ValueError(
-        "Failed to automatically detect CUDA or ROCM installation. Please set the CUDA_HOME or ROCM_HOME environment variable manually (e.g., export CUDA_HOME=/usr/local/cuda or export ROCM_HOME=/opt/rocm)."
-    )
+if IS_LINUX and not (CUDA_HOME or ROCM_HOME):
+    logger.warning("No CUDA/ROCM detected; building CPU-only components.")
700-710: Remove CythonExtension from ext_modules when MAYBE_METAL is true

setup.py returns early for Metal without producing the Cython artifact, but ext_modules still declares CythonExtension — this can break packaging/editable installs.

File: setup.py (around lines 700–710)
-    ext_modules=[
-        CMakeExtension("TileLangCXX", sourcedir="."),
-        CythonExtension("TileLangCython", sourcedir="."),
-    ],
+    ext_modules=[
+        CMakeExtension("TileLangCXX", sourcedir="."),
+    ] + ([] if MAYBE_METAL else [CythonExtension("TileLangCython", sourcedir=".")]),
Alternative: keep the extension but provide a stub .so or mark the extension optional so build_ext is satisfied.

coderabbitai · 2025-09-12T14:43:38Z

+                            # fixme: aarch64 darwin needs something like `-Lxxx -lpython3.12`
                            command = f"{cc} -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}"
+                            # logger.info(command)
                            os.system(command)


🛠️ Refactor suggestion

Fix Darwin link flags for Cython build (non-Metal macs)

Using “-shared” on macOS is incorrect; use -dynamiclib and allow undefined symbols to Python.

- command = f"{cc} -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}" - # logger.info(command) - os.system(command) + if sys.platform == "darwin": + ldflags = "-dynamiclib -undefined dynamic_lookup" + else: + ldflags = "-shared -pthread -fPIC" + command = f"{cc} {ldflags} -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}" + logger.info(command) + subprocess.check_call(command, shell=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# fixme: aarch64 darwin needs something like `-Lxxx -lpython3.12`

command = f"{cc} -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}"

# logger.info(command)

os.system(command)

# fixme: aarch64 darwin needs something like `-Lxxx -lpython3.12`

if sys.platform == "darwin":

ldflags = "-dynamiclib -undefined dynamic_lookup"

else:

ldflags = "-shared -pthread -fPIC"

command = f"{cc} {ldflags} -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}"

logger.info(command)

subprocess.check_call(command, shell=True)

🧰 Tools

🪛 Ruff (0.12.2)

765-765: Starting a process with a shell, possible injection detected

(S605)

🤖 Prompt for AI Agents

In setup.py around lines 762-765, the build command uses the Linux-style "-shared" which is incorrect on macOS; change the link flags for Darwin to use "-dynamiclib" and allow undefined symbols to be resolved by Python by adding "-undefined dynamic_lookup" (and keep other flags as-is or remove -fPIC if desired for macOS), so on macOS construct the command with -dynamiclib and -undefined dynamic_lookup instead of -shared before invoking os.system(command).

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

setup.py (2)

674-683: Copy macOS-built Python extension modules (.so) too; not just .dylib.

On macOS, C extensions are typically .so. Include both patterns.

-import glob
-files = glob.glob("*.so" if IS_LINUX else "*.dylib")
+import glob
+files = glob.glob("*.so")
+if not IS_LINUX:
+    files += glob.glob("*.dylib")

839-854: CMake GPU toggles: explicitly disable unused backends and honor USE_CUDA flag.

In Metal branch, also set USE_CUDA OFF.
In CUDA branch, respect USE_CUDA rather than presence of CUDA_HOME alone.

-        if USE_METAL:
+        if USE_METAL:
             content_lines += [
                 "set(USE_METAL ON)",
                 "set(USE_ROCM OFF)",
+                "set(USE_CUDA OFF)",
             ]
         elif USE_ROCM:
@@
-        elif CUDA_HOME:
+        elif USE_CUDA and CUDA_HOME:
             content_lines += [
                 f"set(USE_CUDA {CUDA_HOME})",
                 "set(USE_ROCM OFF)",
             ]
+        else:
+            content_lines += [
+                "set(USE_METAL OFF)",
+                "set(USE_ROCM OFF)",
+                "set(USE_CUDA OFF)",
+            ]

♻️ Duplicate comments (2)

setup.py (2)

36-44: Boolean env parsing still mishandles empty strings and unrecognized values; treat true/false explicitly.

Current truthiness gate skips empty strings; also missing yes/no aliases. Use explicit parsing.

-def _read_bool_env(name: str, default: bool = False) -> bool:
-    if env := os.environ.get(name):
-        env = env.lower()
-        if env in ['on', '1', 'true']:
-            return True
-        elif env in ['', 'off', '0', 'false']:
-            return False
-    return default
+def _read_bool_env(name: str, default: bool = False) -> bool:
+    val = os.environ.get(name)
+    if val is None:
+        return default
+    s = val.strip().lower()
+    if s in ('1', 'true', 'on', 'yes', 'y'):
+        return True
+    if s in ('0', 'false', 'off', 'no', 'n', ''):
+        return False
+    return default

753-761: Fix Darwin link flags for Cython artifact and avoid shell-based os.system.

Use -dynamiclib and -undefined dynamic_lookup on macOS; prefer subprocess.check_call without shell.

-                            cython = get_cython_compiler()
+                            cython = get_cython_compiler()
                             if cython is None:
                                 raise Exception("Cython is not installed, please install it first.")
-                            os.system(f"{cython} {cython_wrapper_path} --cplus -o {source_path}")
+                            subprocess.check_call([cython, str(cython_wrapper_path), "--cplus", "-o", str(source_path)])
                             python_include_path = sysconfig.get_path("include")
                             cc = get_cplus_compiler()
-                            if MAYBE_METAL:
-                                cc += ' -Wl,-undefined,dynamic_lookup'
-                            command = f"{cc} -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}"
-                            logger.info(command)
-                            os.system(command)
+                            if sys.platform == "darwin":
+                                cmd = [
+                                    cc, "-dynamiclib", "-undefined", "dynamic_lookup",
+                                    "-fwrapv", "-O2", "-Wall", "-fno-strict-aliasing",
+                                    "-I", python_include_path, str(source_path), "-o", str(temp_path)
+                                ]
+                            else:
+                                cmd = [
+                                    cc, "-shared", "-pthread", "-fPIC",
+                                    "-fwrapv", "-O2", "-Wall", "-fno-strict-aliasing",
+                                    "-I", python_include_path, str(source_path), "-o", str(temp_path)
+                                ]
+                            logger.info(" ".join(cmd))
+                            subprocess.check_call(cmd)

🧹 Nitpick comments (3)

setup.py (3)
56-66: Refine Metal default detection; avoid platform.mac_ver()[2] and gate by OS/arch/version.

Arm64-only check is brittle. Prefer feature/OS checks and MPS minimum macOS.
-IS_LINUX = platform.system() == 'Linux'
-MAYBE_METAL = platform.mac_ver()[2] == 'arm64'
+IS_LINUX = platform.system() == 'Linux'
+def _is_apple_silicon_macos(min_ver: str = "12.3") -> bool:
+    if platform.system() != "Darwin":
+        return False
+    try:
+        from packaging.version import Version
+        mac_ver = platform.mac_ver()[0] or "0"
+        return platform.machine().lower() in ("arm64", "aarch64") and Version(mac_ver) >= Version(min_ver)
+    except Exception:
+        return platform.machine().lower() in ("arm64", "aarch64")
+MAYBE_METAL = _is_apple_silicon_macos()
Also consider centralizing this helper (shared with target selection) to keep detection consistent across files.

920-930: Optional: expand classifiers/platforms for Metal/macOS builds.

Consider adding macOS classifier and GPU backend markers reflecting Metal when enabled.

46-69: Optional: document env var precedence and defaults in README.

Clarify how USE_CUDA/USE_ROCM/USE_METAL interact and defaults per platform to reduce surprise.

I can draft a short “Build matrix and env vars” section for README if helpful.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c42d7eea70d878a010888bc4944585c059109e7 and 7037ce81e29c74ac7a31a48b1e795aca5254e496.

📒 Files selected for processing (1)

setup.py (11 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

setup.py (1)

tilelang/env.py (1)

get (136-139)

🪛 Ruff (0.12.2)

setup.py

108-110: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-test-metal
GitHub Check: format-check
GitHub Check: format-check

🔇 Additional comments (3)

setup.py (3)

400-401: LGTM: Linux-only rpath patch guard.

70-86: LGTM: platform-specific prebuilt artifact lists and .dSYM cleanup.

619-623: LGTM: Extension kwargs passthrough for CMakeExtension.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

testing/python/metal/test_metal_codegen.py (1)

52-53: Tensor dimensions are correct for matrix multiplication.

The tensor dimensions in the test are properly set up for matrix multiplication A @ B:

a has shape (M, K)

b has shape (K, N)

c has shape (M, N)

This correctly implements the mathematical operation C = A × B where the inner dimension K is consistent.

Also applies to: 55-57

🧹 Nitpick comments (2)

testing/python/metal/test_metal_codegen.py (2)
18-20: Consider memory alignment for shared memory allocations.

The shared memory allocations for A_shared and B_shared may benefit from memory alignment optimizations to improve performance on Metal. Consider aligning the allocation sizes to powers of 2 or Metal's preferred alignment boundaries.

79-80: Add error message when MPS is not available.

The test silently does nothing when MPS is not available. Consider adding an informative message or skip marker to clarify why tests aren't running.
 if __name__ == "__main__":
     if torch.mps.is_available():
         tilelang.testing.main()
+    else:
+        print("MPS not available. Skipping Metal backend tests.")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7037ce81e29c74ac7a31a48b1e795aca5254e496 and 5dafa7782e286ebec4ba8a41cc41060259260d34.

📒 Files selected for processing (1)

testing/python/metal/test_metal_codegen.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

testing/python/metal/test_metal_codegen.py (6)

tilelang/jit/__init__.py (1)

jit (241-314)

tilelang/language/allocate.py (2)

alloc_shared (21-36)

alloc_fragment (53-64)

tilelang/language/fill.py (1)

clear (24-48)

tilelang/language/pipeline.py (1)

Pipelined (9-46)

tilelang/language/copy.py (1)

copy (84-152)

tilelang/jit/kernel.py (1)

kernel_source (470-471)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build-test-amd
GitHub Check: build-test-metal
GitHub Check: bot-task

🔇 Additional comments (2)

testing/python/metal/test_metal_codegen.py (2)

63-63: Good test coverage for kernel source verification.

The assertion that kernel_source is not None properly validates that the Metal backend generates source code as expected.

25-26: Verify coalesced_width=2 for Metal backend.
T.copy uses coalesced_width=2 in testing/python/metal/test_metal_codegen.py (lines 25–26, 31). Confirm that 2 is the optimal coalescing width for the Metal backend and the tensor element type; if not, update to the backend-specific value or make it configurable.

coderabbitai · 2025-09-13T04:24:51Z

+
+
+def test_gemm_float16():
+    assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1)


🛠️ Refactor suggestion

High tolerance values for float16 and int32 tests.

The tolerance value of atol=1 for float16 and int32 tests seems quite high. For int32, the results should match exactly (atol=0), and for float16, a smaller tolerance might be more appropriate to catch potential precision issues.

def test_gemm_float16(): - assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1) + assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1e-2) def test_gemm_int32(): - assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='int32', atol=1) + assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='int32', atol=0)

Also applies to: 75-75

🤖 Prompt for AI Agents

In testing/python/metal/test_metal_codegen.py around lines 71 and 75, the tests use an overly large absolute tolerance (atol=1) for float16 and int32; change the int32 assertions to use atol=0 (exact match) and tighten the float16 tolerance (e.g., atol=1e-2 or another small value appropriate for float16 precision) so the tests detect precision regressions while still allowing expected float16 rounding error.

douyixuan · 2025-10-20T12:13:57Z

Currently metal target only support tl.jit(execution_backend="torch")

* Reset * Fix other CUDA issue * fmt * fmt * fix cuda error * fix * fix * fmt * cleanup * fix * remove copyright * trivial update * readme update * lint fix --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

@LeiWang1999

* [FFI] Structural equal and hash based on reflection (apache#18156) This PR add initial support for structural equal and hash via the new reflection mechanism. It will helps us to streamline the structural equality/hash with broader support and clean error reports via AccessPath. It also gives us ability to unify all struct equal/hash registration into the extra meta-data in reflection registration. * [Fix][Relax] Fix potential out-of-bounds access in `TupleRewriterNode` (apache#18120) * Root cause * Update * [BugFix] Fix NCCL build with GlobalDef registration (apache#18158) This PR fixes a build failure in nccl.cc due to the recent switch of global function registration. * [FFI][REFACTOR] Introduce TypeAttr in reflection (apache#18160) This PR introduces TypeAttr to reflection to bring extra optional attribute registration that can be used to extend behaviors such as structural equality. Also renames TypeExtraInfo to TypeMetadata for better clarity. * [TIR] Fix trivial index map [] -> [0] (apache#18154) fix trivial index map [] -> [0] Co-authored-by: wrongtest <wrongtest@gmail.com> * [Relax] Fix issue in fuse concat ops by pattern (apache#18163) * [Relax] Fix issue in fuse concat ops by pattern * fix lint * [FFI][REFACTOR] Enable custom s_hash/equal (apache#18165) This PR enables custom shash equal via TypeAttr, also enhances the Var comparison by checking content so we can precheck type signatures. * [FFI][REFACTOR] Migrate StructuralEqual/Hash to new reflection (apache#18166) This PR migrates the StructuralEqual/Hash to new reflection based approach. The original mechanisms are still kept around and we will phase them out in followup PRs. The new mechanism unifies the structural equal/hash registration with the normal reflection registeration and also brings cleaner implementation for mismatch detection. * [Web] Fix incompatible part after FFI updates (apache#18168) fix: incompatible parts with the current core * [REFACTOR][FFI] Phase out SEqualReduce/SHashReduce (apache#18172) This PR phases out old SEqualReduce/SHashReduce mechanism in favor of the new reflection mechanism via ffi/reflection. It helps us to reduce the places we need to register the reflection related information. See the current IR examples for upgrading to the new mechanism. * [FFI] Improve string equal/hash handling (apache#18176) This PR improves the string equal hash handling by improving some of the efficiencies. * [FFI][REFACTOR] Isolate out extra API (apache#18177) This PR formalizes the extra API in FFI. The extra APIs are minimal set of APIs that are not required in core mechanism, but still helpful. Move structural equal/hash to extra API. * [Misc][BugFix] Fix missing PadAttrs register in op_attrs.py (apache#18174) * [COMMUNITY] Add new key for release signing * [Misc][BugFix] Fix missing PadAttrs register in op_attrs.py Fix missing PadAttrs register in op_attrs.py * [FFI] Fix SmallMapInit with duplicated keys (apache#18178) This PR fixes Small map init when there are duplicated keys * [Fix] Fix the wrong check for tuple node in apache#18163 (apache#18170) * [Fix] Fix the wrong check for tuple node in apache#18163 * [REFACTOR] Upgrade NestedMsg<T> to use new ffi::Any mechanism (apache#18181) This PR upgrades NestedMsg<T> to use the new ffi::Any mechanism, which will enable us to get better support and enable NestedMsg for POD types. * [FFI][REFACTOR] Cleanup to align to latest ffi (apache#18183) This PR modernizee legacy use to align with the latest FFI. - Use Any to represent general Any instead of ObjectRef - Use Optional<T>.has_value() instead of defined * [FFI][REFACTOR] Hide StringObj/BytesObj into details (apache#18184) This PR hides StringObj/BytesObj into details and bring implementations to directly focus on the String/Bytes. This change will prepare us for future changes such as SmallStr support. Also moves more ObjectRef into Any in RPC. * [FFI] Introduce small string/bytes (apache#18185) * [Relax] Fix Relax Operator PReLU (apache#18179) * [FFI] Lightweight json parser/writer (apache#18186) This PR adds a lightweight json parser/writer to extra component. * [FFI][EXTRA] Serialization To/From JSONGraph (apache#18187) * [FFI][REFACTOR] Migrate the Save/Load JSON to the new reflection (apache#18188) This PR migrates the Save/Load JSON to the new reflection based mechanism. This is a breaking change that updates the the JSON format to ffi/extra/serialization to handle the serialization, see the json graph schema comment in ffi/extra/serialization.h for the format, which roughly aligns with the old style. After this change, we no longer need node/reflection and reflection vtable. We can also phase out TVM_REGISTER_NODE and TVM_REGISTER_OBJECT to have a single place that defines the reflection. * [FFI][REFACTOR] Phase out getattr based attribute handling (apache#18189) [REFACTOR] Phase out getattr based attribute handling This PR phases out getattar based attribute handling as they are slower and introduces extra code path. This does mean that if an Object is not explicitly registered in python side, we will no longer be able to access the field by name. Likely this is also desirable as we would like to enable faster use that updates the python end and do not rely on these behavior. * [FFI][REFACTOR] Refactor AccessPath to enable full tree repr (apache#18191) This PR refactors AccessPath so it can be used to represent full tree with compact memory. Also fixes a bug in thec cython method export * [FFI] Phase out ObjectPath in favor of AccessPath (apache#18192) This PR phases out ObjectPath in favor of AccessPath * Bump cutlass_fpA_intB_gemm to latest commit (apache#18193) * [REFACTOR] Update data type rewriter to enable recursive rewrite in Any (apache#18197) This PR updates the data type rewriter so recursive rewrite happens in Any instead of ObjectRef. * [Relax][ONNX] Parse ONNX Upsample to Relax resize2d (apache#18180) * [TIR] Fix host/device function check for build (apache#18199) This PR fixes a bug of deciding whether a function is host or device function in TIR build. Previously the decision is made based on checking whether `"cpu"` is a substring of the target string. This check fails to work for ROCm target, which usually comes with an `"mcpu"` attribute that also contains `"cpu"`. This PR fixes by checking target kind. Targets with kind `"llvm"` or `"c"` will be treated as host functions. * [FFI][REFACTOR] Move Downcast out of ffi for now (apache#18198) Downcast was added for backward compact reasons and it have duplicated features as Any.cast. This PR moves it out of ffi to node for now so the ffi part contains minimal set of implementations. * [FFI][REFACTOR] Update Map ABI to enable flexible smallMap switch (apache#18200) This PR updates the Map ABI to use MSB in slots_ to indicate SmallMap. The change would open doors for future changes to small map boundary switch. * [Fix] Codegen fix for relax cutlass (apache#18190) * Codegen fix --------- Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> * [LLVM][CPPTEST] Small fixes for LLVM >= 20 (apache#18202) This PR updates cpptest to run on LLVM >= 20 * [LLVM] Fixes up to the latest LLVM21 (apache#18204) This PR fix TVM use with the latest LLVM version 21. - At this time LLVM21 is available as a release candidate. - Double checks for backward compatibility down to LLVM10 * [FFI][REFATOR] Cleanup entry function to redirect (apache#18205) This PR updates the entry function mechanism to create a stub that redirects to the real function. This new behavior helps to simplify the runtime logic supporting entry function. Also updates the name to `__tvm_ffi_main__` * [CODEGEN][REFACTOR] tir.call_llvm_intrin to remove nargs (apache#18206) This PR refactors the tir.call_llvm_intrin to omit the first nargs argument in the beginning. Previously the nargs was introduced when prefetch have different number of signature. The previous reason no longer stands as of now, and it is less intuitive to attach nargs for the call_llvm_intrin, where nargs directly appears in number of argument. After the update, tir.call_llvm_intrin can directly pass in the arguments as it is. * [FFI][Fix] Update datatype registry calls to the new paths (apache#18208) * [TARGET] Add target for nvidia rtx 5060ti (apache#18211) * [Bug] Fix core dump in InferLayoutRMSNorm and fix typo (apache#18210) Fix core dump in InferLayoutRMSNorm and fix typo * [FFI] Make JSON Parser/Write fastmath safe (apache#18212) This PR adds fallbacks for nan and inf detection/creation under fastmath mode. * [ONNX][FRONTEND][Fix] Update Resize to accept ShapeExpr (apache#18209) [ONNX][FRONTEND] Update Resize to accept ShapeExpr * [FFI] Formalize ffi.Module (apache#18213) This PR formalizes original runtime::Module into ffi as ffi.Module and cleans the APIs around it. The goal is to stablize the Module API as extra API that can benefit the overall ffi interactions. We also refactors the c++ code that depends on the Module. * [Fix][ONNX] No precision widening for numpy binary operations (apache#18207) * [FFI][REFACTOR] Establish ffi.Module in python (apache#18214) This PR refactors and establishes ffi.Module under the python tvm ffi api. Also moves export_library method to executable so it aligns more with compiled artifact. * [FFI][REFACTOR] Establish Stream Context in ffi (apache#18216) This PR sets up the stream context in ffi and migrate the existing per device API stream context management to ffi env API. The new API will help us to streamline stream related integration formost libraries. * [FFI] AudoDLPack compatible with torch stream context (apache#18217) This PR updates the autodlpack path to automatically update the env stream to be consistent with torch stream context. The change would help to make FFI functions to be compatible in stream based executions. We leverage torch cpp_extension load_inline to create an efficient query function, the first time loading might take more time to build the jit module and things should be fast after the torch jit module is cached. * [FFI][REFACTOR] Cleanup API locations (apache#18218) This PR cleans up the env api and move it to extra. * [Fix] Resolve deadlock in PopenPoolExecutor and LocalBuilder (apache#18219) - Add explicit shutdown flag in PopenPoolExecutor - Replace del with explicit shutdown() calls in LocalBuilder * [Thrust] Fix getting CUDA stream (apache#18220) This PR updates the `GetCUDAStream` in CUDA thrust integration to the latest `TVMFFIEnvGetCurrentStream` interface. * [NVSHMEM] Fix compatibility with CUDA code without nvshmem use (apache#18222) This PR fixes two bugs that cause normal TIR functions (ones that don't use any NVSHMEM API) not being able to compile and run, in cases where `set(USE_NVSHMEM xxx)` is enabled. Co-authored-by: Bohan Hou <bohanhou@andrew.cmu.edu> * [FFI] Fix JSON parser/writer for the fast-math flag (apache#18221) This PR fixes the JSON parser and writer for the support of the fast-math flag. Prior to this PR, there would be the following error when compiling TVM with fast-math. This PR fixes this issue. ``` /home/ruihangl/Workspace/tvm/ffi/src/ffi/extra/json_writer.cc: In static member function ‘static bool tvm::ffi::json::JSONWriter::FastMathSafeIsNaN(double)’: /home/ruihangl/Workspace/tvm/ffi/src/ffi/extra/json_writer.cc:69:22: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing] 69 | uint64_t bits = *reinterpret_cast<const uint64_t*>(&x); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/ruihangl/Workspace/tvm/ffi/src/ffi/extra/json_writer.cc: In static member function ‘static bool tvm::ffi::json::JSONWriter::FastMathSafeIsInf(double)’: /home/ruihangl/Workspace/tvm/ffi/src/ffi/extra/json_writer.cc:84:22: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing] 84 | uint64_t bits = *reinterpret_cast<const uint64_t*>(&x); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` * [ROCm] Minor fixes for latest refactor (apache#18225) This PR fixes a few ROCm and hipBLAS build issues after recent refactors. * [CI] Exit the build for AbortException (apache#18227) [CI] Exit the build if met AbortException * [FFI][REFACTOR] Establish tvm_ffi python module (apache#18226) * [FFI][REFACTOR] Establish tvm_ffi as a standalone python module This PR establishes tvm_ffi as a standalone python module. The ffi is structured as a minimal pip module that can be directly install by path or url. examples/get_started provided a minimal example. This is a major change as we are decoupling tvm_ffi as a separate package, users need to install tvm_ffi separately. Thanks to its minimal dependency, tvm_ffi can be easily installed even just from the source by pip install ./ffi This change would enable future improvement for library plugins to have lightweight dependencies by just working on top of the tvm_ffi, while the main compiler toolchain and runtime can be layered on top. * [FFI] Improve traceback setups This PR improves traceback related setups * [FFI] Robustify the pyproject setup (apache#18233) This PR robustifies the pyproject setup to enable compact with cibuildwheel * [LLVM][Fix] Do not emit debuginfo on vscale or other unknown types (apache#18232) * [FFI] Misc fixup for windows (apache#18234) This PR cleans up the ffi module to make it compatible for windows. * [FFI][BUGFIX] Fix type_traits on DataType after SmallStr update (apache#18237) This PR fixes the type_traits on DataType after SmallStr update. We need to explicitly zero out the FFFIAny data structure to allow fast comparison of FFIAny based on bytes values. * [CUTLASS] Fix CUTLASS kernel compilation (apache#18238) This PR fixes a few places in the current CUTLASS kernel AOT compilation. * [Relax] ONNX frontend using relax softplus operator (apache#18242) Use relax softplus operator in onnx frontend * [Relax] Add Python function support and BasePyModule for PyTorch integration (apache#18229) ### **Overview** This PR implements native Python function support in TVM Relax through the `@I.pyfunc` decorator and `BasePyModule`, which enable seamless integration between TVM's compilation pipeline and Python/PyTorch runtime environments. This enhancement allows users to write Python functions directly in TVMScript that can interoperate with Relax and TIR functions that provides enhanced debugging capabilities and leveraging existing PyTorch operator libraries. ### **Key Features** **TVMScript Parser Enhancement** - `@I.pyfunc` decorator: Marks Python functions for integration into IRModules - Dual storage format: Stores both raw string representation (for TVMScript printing) and captured PackedFunc (for runtime execution) - ExternFunc representation: Each Python function is represented as an ExternFunc node with attributes storing source code and runtime wrapper **Complete BasePyModule Implementation** - DLPack-based tensor conversion: Seamless conversion between PyTorch tensors and TVM NDArrays - Cross-function interoperability: Python functions can call Relax/TIR functions and vice versa - JIT compilation: Delays compilation until module instantiation for flexible late-stage modifications - Dynamic function registration: Supports runtime addition of Python functions ### Future Work - TVMScript printer for IRModules with Python functions: Print IRModules in proper format with high-level operator mapping from Relax ops to PyTorch ops, handling symbolic shapes - R.call_py_func primitive: Introduce Relax primitive to invoke corresponding PackedFunc of specified Python functions at runtime * [Fix] Update FlashInfer JIT header lookup (apache#18244) This PR fixes the tvm/dlpack/dmlc header lookup in the FlashInfer kernel JIT compilation. Prior to this fix, the JIT compilation assumes the environment variable `TVM_SOURCE_DIR` is always defined, which is not always true. This PR fixes the behavior and considers multiple cases, including TVM source builds and pip-installed packages. * [LLVM][MSWIN][CI] Fix LLVM module build with latest CI update (apache#18245) * [FFI][CMAKE] Add missing download path for libbacktrace (apache#18246) * [Build] Migrate Python packaging to pyproject.toml with scikit-build-core (apache#18239) This pr migrates the TVM Python packaging system from the setup.py flow to the modern, PEP 517/518 compliant pyproject.toml standard, which allows us to produce a single, Python-version-agnostic wheel. This change streamlines the process for both developers and users. For local development, you can now set up a fully-functional editable environment with a single command: `pip install -e .`. To create the distributable package for release, simply run `pip wheel -w dist .` , which will produce a universal wheel in the `dist/` folder. This ensures that end-users can reliably install TVM with a standard pip install tvm, regardless of their specific Python 3 version. * [FFI][CMAKE] Revert cmake libbacktrace URL and update submodule (apache#18249) * Revert the URL out from cmake for libbacktrace * Switch git submodule to upstream HEAD instead As per discussed here apache#18246 (comment), this reverts in favour of git submodule way. As per finding in the same discuss the upstream [already](https://github.com/ianlancetaylor/libbacktrace/blob/793921876c981ce49759114d7bb89bb89b2d3a2d/macho.c#L1273-L1275) incorporates [the one patch](ianlancetaylor/libbacktrace@master...tlc-pack:libbacktrace:master) used, and MacOS works fine. * [Python] Update version.py to bump pyproject.toml automatically (apache#18248) This PR updates the `version.py`, so every time when running this file, it also bumps the version number in `pyproject.toml` automatically. * [Python] Complete Python packaging with scikit-build-core (apache#18251) Following apache#18239, this PR fixes a few issues we ran into during testing the packaging flow through scikit-build-core. * upgrade cutlass v4.2.0 supporting cuda 13 (apache#18236) * upgrade cutlass v4.2.0 supporting cuda 13 * upgrade cutlass v4.2.0 supporting cuda 13 * [FFI][ABI] ABI Updates to for future metadata and complex ordering (apache#18254) This PR updates the ABI to enable potential future need for getting metadata from a dynamically loaded module. Orders the current static object into simple objects that have C ABI and more complex one that may need c++. These items changes ABI to be future compact before we freeze. * [FFI][DOCS] Wheel Packaging (apache#18256) [FFI] Wheel packaging example This PR add an example about wheel packaging. Also fixes various source packaging minor nits. * [FFI] fix two seemingly migration issue (apache#18258) * [FFI][ABI] Introduce weak rc support (apache#18259) This PR adds weak ref counter support to the FFI ABI. Weak rc is useful when we want to break cyclic dependencies. - When a strong rc goes to zero, we call the destructor of the object, but not freeing the memory - When both strong and weak rc goes to zero, we call the memory free operation The weak rc mechanism is useful when we want to break cyclic dependencies in object, where the weak rc can keep memory alive but the destructor is called. As of now, because we deliberately avoid cyles in codebase, we do not have strong use-case for weak rc. However, given weak rc is common practice in shared_ptr, Rust RC, and also used in torch's c10::intrusive_ptr. It is better to make sure the ABI is future compatible to such use-cases before we freeze. This PR implements weak rc as a u32 counter and strong rc as a u64 counter, with the following design consideration. - Weak rc is very rarely used and u32 is sufficient. - Keeping weak rc in u32 allows us to keep object header size to 24 bytes, saving extra 8 bytes(considering alignment) We also need to update deleter to take flags that consider both weak and strong deletion events. The implementation tries to optimize common case where both strong and weak goes to 0 at the same time and call deleter once with both flags set. * [FFI][DOCS] Add missing files in packaging example (apache#18261) This PR adds the missing files in packaging example also renames get_started to quick_start * [BugFix][NNAPI] Use kind() instead of type_key() after FFI refactor (apache#18262) [BugFix][NNAPI] Use kind() after FFI refactor This commit updates nnapi_runtime.cc to override kind() instead of type_key(), aligning NNAPI with the new FFI interface. Behavior is consistent with other runtimes that were updated in commit b8eb80b. * [FFI][DOCS] Initial docs scaffolding (apache#18263) * [DOCS] Misc docs fix (apache#18264) This PR provides misc docs fix, updates the requirements of ffi docs remove stale webpages from header, update embedding script to allow path. * [Build] Complete TVM wheel building migration (apache#18252) * finish1 * finish2 * finish3 * update * update2 * update3 * update4 * update4 * update6 * Rename build step and update installation commandFix * fix * fix2 * fix3 * [Relax] Building TVMScript printer for IRModules with Python functions (apache#18253) This PR implements TVMScript printer to format IRModules containing `@I.pyfunc` decorated Python functions. Example: ``` @I.ir_module class MyModule(BasePyModule): @I.pyfunc def python_func(self, x, y): x_tvm = self._convert_pytorch_to_tvm(x) y_tvm = self._convert_pytorch_to_tvm(y) result = self.call_tir(self.add_tir, [x_tvm, y_tvm], out_sinfo=R.Tensor((5,), "float32")) return self._convert_tvm_to_pytorch(result) @T.prim_func def add_tir(a: T.handle, b: T.handle, c: T.handle): A = T.match_buffer(a, (5,), "float32") B = T.match_buffer(b, (5,), "float32") C = T.match_buffer(c, (5,), "float32") for i in range(5): C[i] = A[i] + B[i] # Usage: print(MyModule.script()) # Print formatted TVMScript MyModule.show() # Display formatted output ``` * [FFI] Update torch stream getter to use native torch c api (apache#18266) This PR updates the torch stream getter to use _cuda_getCurrentRawStream in the torch C API that is also used by dynamo, saves us from load_inline the custom module. * [FFI] Support Opaque PyObject (apache#18270) * [FFI] Support Opaque PyObject This PR adds support of Opaque PyObject. When a type in python is not natively supported by ffi, it will now be converted to an Opaque PyObject on the backend, such opaque object will retain their lifecycle automatically and can still be used by registering python callbacks or store in container and return to the frontend. * Round of grammar polishment * [FFI] Support inline module (apache#18271) This PR adds initial support for load_inline in tvm_ffi * [FFI] Construct NDArray.strides by default (apache#18272) This PR updates NDArray.strides to construct strides by default * [FFI][ABI] Append symbol prefix for ffi exported functions (apache#18273) Previously we simply take the raw symbol for DSO libraries. This can cause symbol conflict of functions that take the ffi calling convention and those that are not. This PR updates the convention to ask for LLVM and libary module to always append a prefix __tvm_ffi_ to function symbols, this way we will no longer have conflict in TVM_FFI_EXPORT_DLL_TYPED macro * [FFI] Update the interface of `ffi.load_inline` to match torch (apache#18274) This PR update the interface of ffi.load_inline to match torch.utils.cpp_extensions.load_inline: - Rename cpp_source to cpp_sources, cuda_source to cuda_sources. - Unify the cpp_functions and cuda_functions into functions. - Add build_directory to allow the user to specify the build directory directly. * [FFI][REFACTOR][ABI] Rename NDArray to Tensor (apache#18275) This PR Updates the NDArray => Tensor. Both tensor and ndarray are commonly used terms. Because the term Tensor is getting more common in the context of ML, we do the rename to stay more aligned with torch.Tensor and DLTensor. * [FFI] Add ffi::Tensor.strides() (apache#18276) * ffi::Tensor strides * [FFI][REFACTOR] Cleanup tvm_ffi python API and types (apache#18277) This PR cleans up the python API to make things more consistent with existing python array api and torch. Device update - device_id => index, to be consistent with torch - device_type => dlpack_device_type() returns int - added type property same as torch.device API updates: - Move the convenient method like cpu() out into tvm runtime to keep device minimal - tvm_ffi._init_api => tvm_ffi.init_ffi_api - tvm_ffi.register_func => tvm_ffi.register_global_func * [FFI] Temp skip load_inline tests nonlinux (apache#18278) This PR temp skip load_inline tests on nonlinux before we enhance and improve for other platforms. * [LLVM][METASCHEDULE] Add RISCV V-extension v1.0 kernels to metaschedule (apache#18243) - Enables high performance kernels covering majority of usual ML datatype inputs - It is currently compliant with RVV specs version v1.0 (does not work with older v0.7.1) - TIR kernels implemented here are using recently added VLA extension support * [FFI][DOCS] Initial bringup of cpp docs (apache#18279) This PR brings up initial version of cpp api docs. * [FFI][Bugfix] Fix bug of `ffi.cpp.load_inline` on Windows (apache#18281) This PR enables the load_inline on windows platform: * [FFI][REFACTOR] Cleanup namespace (apache#18280) * [FFI][REFACTOR] Cleanup namespace This PR cleansup the namespace to ensure all ffi classes are accessed through ffi:: namespace. It will helps to cleanup the ffi package before isolation. * fix hexagon * [FFI] Relax default alignment and continguous requirement (apache#18282) This PR relax default alignment and continguous requirement in dlpack import. This allows the ffi to be useful in most settings. We also provide utility for users to check these requirements themselves. * [Fix][Metal] Fix type for device array in Metal API (apache#18283) This PR fixes a typo in the previous ffi namespace cleanup. * [Relax] Add Relax to Python Function Converter (apache#18269) ### Overview This PR implements a Relax to Python Function Converter that transforms Relax functions into executable Python functions using PyTorch operations. This enables seamless conversion between TVM's Relax IR and Python/PyTorch environments, which provides enhanced debugging capabilities and leveraging existing PyTorch operator libraries for testing and deployment purposes. ### Key Feature - **High-level operator mapping**: Maps 60+ Relax operators to corresponding PyTorch APIs - **Special operation handling**: Supports `call_tir`, `call_dps_packed`, and Relax function calls with DLPack integration - **Symbolic shape support**: Handles symbolic shapes and dynamic tensor operations ### **Example** ```python from tvm.relax.relax_to_pyfunc_converter import RelaxToPyFuncConverter # Convert Relax functions to Python functions converter = RelaxToPyFuncConverter(ir_module) converted_ir_mod = converter.convert("my_function") # Execute converted function with PyTorch tensors result = converted_ir_mod.pyfuncs['my_function'](input_tensor) ``` * [FFI][REFACTOR] Introduce UnsafeInit and enhance ObjectRef null safety (apache#18284) This PR enhances the nullptr and general type-safe of ObjectRef types. Previously ObjectRef relies on constructor from ObjectPtr<Object> for casting and initialize from nullptr. We introduce a tag ffi::UnsafeInit, which explicitly states the intent that the initialization is unsafe and may initialize non-nullable Ref to null. Such tag should only be used in controlled scenarios. Now the general RefType(ObjectPtr<Object>) is removed. We still keep RefType(ObjectPtr<ContainerType>) for nullable objects, but removes the default definition from non-nullable types, knowing that user can always explicitly add it to class impl (ensuring null checking). * [Hotfix] Fix the conflicts about ffi-related updated names (apache#18287) * Change registration of mock softmax function * Update check_asf_header.sh Remove unnecessary blank line in check_asf_header.sh * Update check_asf_header.sh * fix * [FFI][Bugfix] Enable `load_inline` on macos (apache#18285) This PR fix the bug to enable `tvm_ffi.cpp.load_inline` on macos. We need to link the `libtvm_ffi.dylib` to the custom module. * [Metal] Fix MetalModuleCreate (apache#18290) This PR fixes a type mismatch in MetalModuleCreate when initializing a MetalModule. The error does not show up until the recent ObjectRef null safety. * [3rdparty] Bump cutlass_fpA_intB_gemm to fix SM90 build (apache#18291) This PR fixes a SM90 build issue when CUTLASS is enabled. The issue is because a source file indluced a CUTLASS header file that has been removed since CUTLASS 4. Simply removing the header fixes the build issue. * [FFI][REFACTOR] Streamline Object Declare Macros (apache#18289) * [Fix] Set DRefObj and CUDAIPCMemoryObj as mutable (apache#18294) This PR marks `DRefObj` and `CUDAIPCMemoryObj` as a mutable object classes. The flags are missed during previous macro refactor. * [FFI][ABI] Introduce generic stream exchange protocol (apache#18295) This PR adds a __tvm_ffi_env_stream__ protocol for generic tensors to exchange env stream to tvm ffi. Also renames TVMFFIEnvSetStream to TVMFFIEnvSetCurrentStream. * [FFI] Temp skip windows tests (apache#18297) * [Fix] Add libxml2 dependency to fix Windows CI build failure (apache#18296) * [FFI] Fix system library symbol lookup (apache#18298) * [Relax] Add symbolic shape support to BasePyModule for dynamic tensor operations (apache#18288) This PR adds symbolic shape support to `BasePyModule`, which enables dynamic tensor operations with runtime shape inference. This allows users to use Relax's symbolic shape functionality in Python function calls through BasePyModule, with dimensions automatically resolved at execution time based on input tensor shapes. ## Usage Example ```python import tvm from tvm.script import ir as I, relax as R from tvm.relax.base_py_module import BasePyModule import numpy as np @I.ir_module class VectorAddModule(BasePyModule): @R.function def add(x: R.Tensor(("n",), "float32"), y: R.Tensor(("n",), "float32")) -> R.Tensor(("n",), "float32"): return R.add(x, y) module = VectorAddModule(device=tvm.cpu(0), target="llvm") a = np.array([1.0, 2.0, 3.0], dtype="float32") b = np.array([4.0, 5.0, 6.0], dtype="float32") result = module.add(a, b) # Result: [5.0, 7.0, 9.0] ``` * [CUDA] Support NVTX in CUDA 13 (apache#18300) This PR adds the support of NVTX for CUDA 13. The change is because that starting CUDA 13, the nvtx functions are moved to the lirbary of `libnvtx3interop.so`, and the previous nvToolsExt library no longer exists. To ensure compatibility with both CUDA 12 and 13, we add libnvtx3interop.so to the library lookup list. * [Python] Fix runtime tensor import (apache#18299) This PR fixes a few places where the python import of runtime tensor is incorrect. The error wasn't revealed in the previous NDArray->Tensor rename PR since these imports are not at the top level. * [FFI][REFACTOR] Refactor python ffi call mechanism for perf (apache#18302) This PR refactors python ffi call mechanism. Previously the argument setting can become an as things can be sensitive to the if checking order. This PR refactors the calling to leverage a C++ based dispatcher where each dispatch functor can be registered from Cython. * Clear ext_lib_dll_names for macOS platform (apache#18304) Removed external library DLL names for macOS. found during tile-ai/tilelang#799 cc @LeiWang1999 * [Relax] Fix RelaxToPyFuncConverter compatibility and improve fallback handling (apache#18301) This PR fixes multiple compatibility issues in `RelaxToPyFuncConverter` caused by recent TVM API changes and improves the robustness of fallback tensor handling. * [FFI][ABI][REFACTOR] Enhance DLPack Exchange Speed and Behavior (apache#18306) This PR enhances DLPack exchange by introducing DLPackPyObjectExporter, DLPackPyObjectImporter and DLPackTensorAllocator. These three function pointers will help us to speedup import/export with DLPack and also streamline the rare(but still useful sometimes) allocation inside the FFI. They can help to significantly speedup autodlpack import. They will also enable us to be able to query the allocator from env and return ffi::Tensor back to the caller environment(experimental), when a function takes torch.Tensor as argument, returned Tensor values will be converted to torch.Tensor. Also renames SetCurrentStream => SetStream to align with styles in CUDA API. Finally, we add option to select whether we release GIL, we release gil by default like ctypes, however, for short running functions it may be helpful to set func.release_gil = False * [FFI] Update `load_inline` interface (apache#18307) update load_inline interface * [FFI][ABI] Refactor the naming of DLPack speed converter (apache#18308) Update the name to avoid potential confusion * [FFI][ABI] Better String and Nested Container handling (apache#18311) [FFI][ABI][REFACTOR] Better String and nested container handling This PR improves the overall String/Bytes and nested container handling It also fixes a bug for temp object recycling when temp object. - Introduce formal API for string/bytes creation - Updates the tuple/dict conversion to also preserve the torch stream - So if a function takes a list of torch.Tensor, torch stream will be setup in context - Optimizes recursive argument conversion by moving most logic into c++ * [FFI][REFACTOR] Update TVM_FFI_STATIC_INIT_BLOCK to fn style (apache#18312) This PR updates TVM_FFI_STATIC_INIT_BLOCK to function style. Now we do the code as follows, which is cleaner in generally and also helps error reporting to locate the right place. ``` TVM_FFI_STATIC_INIT_BLOCK() { RegisterStaffs(); } ``` * [REFACTOR][FFI] Split tvm-ffi into a separate repo (apache#18314) This PR updates the code so we split tvm-ffi into a separate repo * [FlashInfer] Update include path and interface (apache#18317) This PR updates the include path for FlashInfer JIT compilation, and also updates the plan function interface for attention prefill computation, to align with recent interface change in flashinfer-ai/flashinfer#1661. * [3rdparty] Remove dlpack/libbacktrace from 3rdparty (apache#18318) [3rdparty] Remove dlpack/libbactrace from 3rdparty This PR removes the TVM dependency on dlpack and libbacktrace, as tvm-ffi being separated to https://github.com/apache/tvm-ffi. * [TVMScript] Support continue and break in tvmscript (apache#17804) * support continue and break in tvmscript * fix black format * fix pylint issue * Update tests/python/tvmscript/test_tvmscript_syntax_sugar.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add printer/parser test, fix lint * Fit to latest ffi update * Skip i386 numpy-related test * Introduce AnnotateIrregularLoop before any lowering loop expansions. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Relax] Introduce R.call_py_func operator for calling Python functions from Relax IR (apache#18313) This PR allows calling Python functions directly from Relax IR, where integration between Relax computations and Python/PyTorch operations can be supported. ### Usage Example ```python @I.ir_module class MyModule(BasePyModule): @I.pyfunc def pytorch_add(self, x, y): return x + y @R.function def compute(x: R.Tensor((5,), "float32"), y: R.Tensor((5,), "float32")) -> R.Tensor((5,), "float32"): result = R.call_py_func("pytorch_add", (x, y), out_sinfo=R.Tensor((5,), "float32")) return result ``` * [TIR][CUDA] Preserve float precision in codegen with hexfloat output (apache#18320) Previously, `float` constants in codegen were always emitted in **scientific decimal format**, e.g.: ```cpp bfloat16_t(3.487723e-05f); ``` This could introduce slight **rounding differences** compared to the actual binary representation, since the constant is printed and then re-parsed in decimal. we now emit the value in **hexadecimal floating-point format** (`std::hexfloat`) to preserve the exact binary value, and additionally include the decimal form as a comment for readability: ```cpp bfloat16_t(0x1.2492492492492p-15f /*3.487723e-05*/) ``` * [BugFix] Fixing binding for bert (apache#18324) * Fixing binding for bert * Fixing names * [TIR] Add support for conditional expressions in TVMScript (apache#18323) Add support for conditional expressions in TVMScript This PR adds support for conditional expressions in TVMScript parser, which allows developers to use Python-style conditional expressions ```python @T.prim_func def func(A: T.buffer((128, 128), "float32")): for i, j in T.grid(128, 128): A[i, j] = i if i < j else j @T.prim_func def expected(A: T.buffer((128, 128), "float32")): for i, j in T.grid(128, 128): A[i, j] = T.if_then_else(i < j, i, j) ``` * Fixing datatype error for gpt-2 (apache#18328) * [CMake][Web] Install `web/` directory in cmake for Python package (apache#18327) This PR updates the CMakeLists to install the web subdirectory when building Python package, so that people do not need to clone TVM source code to build web package. * [Relax][Backend] Implement R.call_py_func operator for calling Python functions from compiled TVM (apache#18326) This PR implements the `R.call_py_func` operator that allows compiled TVM Relax modules to call Python functions at runtime. This enables integration between TVM's compiled code and Python through a robust VM backend implementation. #### Simple Usage with BasePyModule ```python @I.ir_module class MyModule(BasePyModule): @I.pyfunc def torch_relu(self, x): return torch.relu(x) @R.function def forward(x: R.Tensor((10,), "float32")) -> R.Tensor((10,), "float32"): return R.call_py_func("torch_relu", (x,), out_sinfo=R.Tensor((10,), "float32")) ``` #### Direct VM Backend Usage (Manual) ```python # Manually register Python function with VM backend register_func = tvm.get_global_func("vm.builtin.register_py_func") register_func("my_func", my_python_function) # Use in Relax function (compiled to VM backend) @R.function def test(x: R.Tensor((5,), "float32")) -> R.Tensor((5,), "float32"): return R.call_py_func("my_func", (x,), out_sinfo=R.Tensor((5,), "float32")) # Manual cleanup (required for direct VM backend usage) clear_func = tvm.get_global_func("vm.builtin.clear_py_func_registry") clear_func() ``` * [flashinfer] Support directing JIT to FlashInfer GroupedGemm kernels (apache#18325) in tvm/python/tvm/relax/backend/cuda/flashinfer.py added a `gen_grouped_gemm_module` in tvm/tests/python/relax/test_group_gemm_flashinfer.py added tests for different combinations of - input and output types: ("float8_e4m3fn", "float8_e4m3fn", "bfloat16"), ("float8_e4m3fn", "float8_e4m3fn", "float16"), - scale granularity of m, n, k: (1, 128, 128), - scale major mode: "MN", "K" - mma_sm: 1, 2 - different batch sizes and m_sizes * [Relax][Frontend][ONNX] Error converting operator Expand: TVMError: broadcast_to expects the input tensor shape is broadcastable to the target shape (apache#18329) * [Relax] Operator and RoPE support for Llama4 (apache#18336) Added LLama4 implementation, new rope implementation * Fix conflict parameter name promote_dtye in FP8ComputeLegalize (apache#18334) * [FFI][ABI] Bump version ffi to latest (apache#18332) This PR bumps the version of tvm-ffi to latest, which involves an ABI change. * [Relax][PyTorch] Support MatrixMultiply op for ExportedProgram importer (apache#18343) This pr supports `mm.default` for ExportedProgram importer. Resolves the issue apache#18339. * [Relax] Update BasePyModule with faster DLPack converter for tensor conversion (apache#18331) This PR enhances `BasePyModule` by integrating a faster DLPack converter for efficient tensor conversion between TVM and PyTorch following apache#18306. * [TIR] Support sequence comparisons in TVMScript (apache#18341) Implement proper parsing and evaluation of chained comparison operators (e.g., `0 < i < 128`) in TVMScript. The sequence comparisons are now correctly expanded to their logical equivalents (e.g., `(0 < i and i < 128)`). Changes: - Updated expression evaluator to handle sequence comparisons correctly - Added test case to verify sequence comparison functionality * [FFI][ABI] Bump tvm-ffi version to reflect RC ABI Update (apache#18345) This PR bumps tvm-ffi version. The latest version contains a change to the RC ABI that also needs web runtime update. * [Python] Add library lookup path for tvm installed as a pakcage (apache#18348) [Python] Add library lookup path when tvm installed as a pakcage * [FFI][ABI] Bump tvm-ffi to latest (apache#18349) This PR bumps tvm-ffi to latest. Which introduces ShapeView andminimizes TensorObj ABI. * [Relax][Frontend][Torch] Fix parsing error when input dimension of unbind is 1 (apache#18351) * [Relax][Frontend][Torch] Fix parsing error when input dimension of unbind is 1 * reformat code * [Fix] Update ShapeView use in nccl.cc (apache#18352) This PR fixes the use of ShapeView in nccl.cc, which was using `Shape()->Product()`. This has been changed to `Shape().Product()` with the introduction of ShapeView. * [Relax][PyTorch] Support lstm op for ExportedProgram importer (apache#18346) This pr supports `lstm.input` for ExportedProgram importer. This links to issue apache#18340 * [Relax][ONNX] Support AllClassNMS Operator for ONNX Frontend (apache#18321) Follow apache#18175 , this PR supports AllClassNMS Operator for ONNX Frontend * [FFI][ABI] Bump tvm-ffi to latest (apache#18354) This pr bumps the tvm-ffi module to latest * [CUDA] Update FlashInfer JIT integration (apache#18353) Following recent JIT refactor in FlashInfer that uses TVM FFI as the JIT interface, this PR updates the JIT integration of FlashInfer in TVM. Major changes: * we leverage FlashInfer's `JitSpec.build_and_load` to compile all the JIT-generated source files, and remove the compilation logic in TVM. * for efficient tensor buffer management and efficient pointer calculation, we enforced all `byte_offset` fields of auxiliary tensors in KV cache to be zeros. The byte offset is now directly applied to the data pointers. * we also add a new parameter to FlashInfer JIT that controls whether returning a linked shared library, or a list of compiled object paths. For unit tests, returning a shared library is convenient and preferred, while for cases such as MLC model compilation, object files are needed to serialize the compiled model. * rebase && update tvm-ffi * fix(kvcache): restore GPU support lost during rebase --------- Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> Co-authored-by: Phoslight <136557413+Phoslight@users.noreply.github.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: wrongtest <wrongtest0@gmail.com> Co-authored-by: wrongtest <wrongtest@gmail.com> Co-authored-by: chenxinli <39092231+cccxinli@users.noreply.github.com> Co-authored-by: Park Woorak <grf0503@gmail.com> Co-authored-by: ysh329 <ysh329@users.noreply.github.com> Co-authored-by: Yong Wu <yongcale@gmail.com> Co-authored-by: Chenfan <jcf94@outlook.com> Co-authored-by: Balint Cristian <cristian.balint@gmail.com> Co-authored-by: zyl_keep_moving <dream20151224@163.com> Co-authored-by: Wei Wang <w10493wang@163.com> Co-authored-by: Bohan Hou <bohanhou@andrew.cmu.edu> Co-authored-by: Marcel Dudek <43888122+MarcelDudek@users.noreply.github.com> Co-authored-by: Shushi Hong <820958424@qq.com> Co-authored-by: Johnny <johnnync13@gmail.com> Co-authored-by: Henry Hsieh <72457607+Henryshsieh@users.noreply.github.com> Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com> Co-authored-by: Yaxing Cai <caiyaxing666@gmail.com> Co-authored-by: Yichen Yan <oraluben@outlook.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: Thais Camacho <thaiscamachoo@gmail.com> Co-authored-by: Siyuan Feng <25500082+Hzfengsy@users.noreply.github.com> Co-authored-by: Anrui(Henry) Liu <98249030+neurusL@users.noreply.github.com> Co-authored-by: Neo Chien <cchung100m@cs.ccu.edu.tw> Co-authored-by: Pranav Venkatram <56809863+giterator@users.noreply.github.com> Co-authored-by: Qingchao Shen <qingchaoshen@outlook.com> Co-authored-by: Masahiro Hiramori <Hiramori.Masahiro@ct.MitsubishiElectric.co.jp> Co-authored-by: Ruxiao Yin <78540598+eaten-cake@users.noreply.github.com> Co-authored-by: cwx <you@example.com>

oraluben force-pushed the metal branch 3 times, most recently from 68f58dd to 82dbdc0 Compare September 12, 2025 03:37

oraluben marked this pull request as ready for review September 12, 2025 03:49

coderabbitai Bot reviewed Sep 12, 2025

View reviewed changes

oraluben added a commit to oraluben/tvm that referenced this pull request Sep 12, 2025

Clear ext_lib_dll_names for macOS platform

207b705

Removed external library DLL names for macOS. found during tile-ai/tilelang#799 cc @LeiWang1999

oraluben mentioned this pull request Sep 12, 2025

Clear ext_lib_dll_names for macOS platform apache/tvm#18304

Merged

coderabbitai Bot reviewed Sep 12, 2025

View reviewed changes

Comment thread setup.py

coderabbitai Bot reviewed Sep 12, 2025

View reviewed changes

FrozenGene pushed a commit to apache/tvm that referenced this pull request Sep 12, 2025

Clear ext_lib_dll_names for macOS platform (#18304)

85dc1d7

Removed external library DLL names for macOS. found during tile-ai/tilelang#799 cc @LeiWang1999

coderabbitai Bot reviewed Sep 12, 2025

View reviewed changes

Comment thread .github/workflows/metal_ci.yml Outdated

coderabbitai Bot reviewed Sep 12, 2025

View reviewed changes

oraluben force-pushed the metal branch 3 times, most recently from 9f7474c to 98336a3 Compare September 12, 2025 10:21

coderabbitai Bot reviewed Sep 12, 2025

View reviewed changes

Comment thread .github/workflows/metal_ci.yml Outdated

Comment thread format.sh Outdated

coderabbitai Bot reviewed Sep 12, 2025

View reviewed changes

coderabbitai Bot reviewed Sep 13, 2025

View reviewed changes

Comment thread setup.py

Comment thread setup.py

coderabbitai Bot reviewed Sep 13, 2025

View reviewed changes

oraluben deleted the metal branch October 7, 2025 07:18

coderabbitai Bot mentioned this pull request Oct 9, 2025

[Build] Migrate to scikit-build-core #939

Merged

21 tasks

oraluben mentioned this pull request Oct 14, 2025

[CI][Refactor] Merge test CI workflow files into one #973

Merged

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Merged

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

This was referenced Nov 12, 2025

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

[Language] Add type stubs for tir op #1239

Merged

coderabbitai Bot mentioned this pull request Nov 14, 2025

[FFI] Use tvm ffi as the default execution backend #1259

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

coderabbitai Bot mentioned this pull request Dec 15, 2025

feat(cutedsl): add CuTeDSL backend #1421

Merged

kurisu6912 mentioned this pull request Dec 17, 2025

[Bug] Fix tvm build script when patchelf is not found #1459

Merged

This was referenced Dec 21, 2025

[Enhancement] Refactor KernelCache to use inheritance-based design #1483

Merged

[Refactor] Phaseout execution_backend ctypes #1510

Merged

coderabbitai Bot mentioned this pull request Jan 12, 2026

[tvm-ffi] Enable tvm-ffi for metal backend #1289

Merged

LeiWang1999 mentioned this pull request Feb 21, 2026

Add a new target guide #1864

Closed

coderabbitai Bot mentioned this pull request Feb 25, 2026

[Metal] Add Metal GEMM support with simdgroup_matrix MMA #1869

Open

This was referenced Apr 30, 2026

Rebase Metal simdgroup GEMM support and runtime coverage #2130

Open

[Metal] FP8 vector cast lanes 2/3/4 (extends storage-only FP8) #2145

Open

[Metal] add T.BlockScaledLayout.e8m0_k32 + T.e8m0_to_float (blockscaled FP8) #2147

Closed

		if is_metal_target(target):
		assert execution_backend == 'torch', 'Currently metal target only support `tl.jit(execution_backend="torch")`'

		self.kernel_name = func_or_mod.__name__ + '_kernel'
		self.verbose = verbose

- self.kernel_name = func_or_mod.__name__ + '_kernel'
- self.verbose = verbose
+ self.kernel_global_source = kernel_global_source
+ if hasattr(func_or_mod, '__name__'):
+ self.kernel_name = func_or_mod.__name__ + '_kernel'
+ elif hasattr(func_or_mod, 'get_global_vars'):
+ # Handle IRModule case
+ global_vars = func_or_mod.get_global_vars()
+ if global_vars:
+ self.kernel_name = global_vars[0].name_hint + '_kernel'
+ else:
+ raise ValueError("Cannot determine kernel name from IRModule")
+ else:
+ raise ValueError(f"Cannot determine kernel name from {type(func_or_mod)}")
+ self.verbose = verbose


		if self._kernel is None:

		_kernel = getattr(torch.mps.compile_shader(self.kernel_global_source), self.kernel_name)

- _kernel = getattr(torch.mps.compile_shader(self.kernel_global_source), self.kernel_name)
+ try:
+ compiled_shader = torch.mps.compile_shader(self.kernel_global_source)
+ if not hasattr(compiled_shader, self.kernel_name):
+ raise AttributeError(f"Compiled shader does not contain kernel '{self.kernel_name}'")
+ _kernel = getattr(compiled_shader, self.kernel_name)
+ except Exception as e:
+ raise RuntimeError(f"Failed to compile Metal shader: {e}")

- run: uv pip install -r requirements-test.txt -r requirements-build.txt
+ run: |
+ set -e
+ [[ -f requirements-test.txt ]] && uv pip install -r requirements-test.txt
+ [[ -f requirements-build.txt ]] && uv pip install -r requirements-build.txt



		def test_gemm_float16():
		assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1)

Conversation

oraluben commented Sep 10, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions Bot commented Sep 10, 2025

Uh oh!

oraluben commented Sep 12, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Sep 12, 2025

oraluben commented Sep 10, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 10, 2025 •

edited

Loading