[Backend] Add metal backend#799
Conversation
WalkthroughAdd Apple Metal (MPS) support across build, JIT, adapters, lowering, runtime, tests, CI and packaging; resolve prebuilt TVM library locations at configure time; add device utilities and generalize profiler/autotuner/cache behavior for CUDA and MPS backends. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer
participant JIT as tilelang.jit
participant WR as TL Wrapper
participant LWR as Lowering
participant TVM as TVM (target.build.metal)
participant AD as MetalKernelAdapter
participant MPS as torch.mps runtime
Dev->>JIT: jit(func, target="metal", execution_backend="torch")
JIT->>JIT: validate is_metal_target -> require backend=="torch"
JIT->>WR: wrap(scheduled_ir, source, target=metal)
WR-->>JIT: TLMetalSourceWrapper
JIT->>LWR: device_codegen_without_compile(target=metal)
LWR->>TVM: call target.build.metal(device_mod, target)
TVM-->>JIT: device_mod + kernel_source
JIT->>AD: create MetalKernelAdapter(func_or_mod, device_mod, kernel_source)
Dev->>AD: call kernel(...)
AD->>AD: on first call: torch.mps.compile_shader(kernel_source)
AD->>MPS: launch compiled kernel with threads/group_size
MPS-->>Dev: result
note over AD: compiled launcher cached for subsequent calls
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
68f58dd to
82dbdc0
Compare
|
While there're tasks remain, the main implementation is ready for review @LeiWang1999 |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (14)
tilelang/carver/arch/cdna.py (1)
35-38: Sort all (RUF022).Keep exports alphabetized for consistency and to satisfy Ruff.
Apply:
-__all__ = [ - 'is_cdna_arch', - 'CDNA', -] +__all__ = [ + 'CDNA', + 'is_cdna_arch', +]tilelang/carver/arch/cpu.py (1)
23-26: Sort all (RUF022).Align with other arch modules.
-__all__ = [ - 'is_cpu_arch', - 'CPU', -] +__all__ = [ + 'CPU', + 'is_cpu_arch', +]tilelang/contrib/cc.py (1)
93-95: Add typing/docstring to is_mac().Small polish; helps usage elsewhere.
-def is_mac(): - return platform.system() == 'Darwin' +def is_mac() -> bool: + "True if running on macOS (Darwin)." + return platform.system() == "Darwin"tilelang/carver/arch/metal.py (1)
12-15: Sort all (RUF022).Match the pattern used in other arch modules.
-__all__ = [ - 'is_metal_arch', - 'METAL', -] +__all__ = [ + 'METAL', + 'is_metal_arch', +]tilelang/carver/arch/cuda.py (1)
150-159: Sort all (RUF022).Keep exports alphabetized.
-__all__ = [ - 'is_cuda_arch', - 'is_volta_arch', - 'is_ampere_arch', - 'is_ada_arch', - 'is_hopper_arch', - 'is_tensorcore_supported_precision', - 'has_mma_support', - "CUDA", -] +__all__ = [ + "CUDA", + 'has_mma_support', + 'is_ada_arch', + 'is_ampere_arch', + 'is_cuda_arch', + 'is_hopper_arch', + 'is_tensorcore_supported_precision', + 'is_volta_arch', +]tilelang/utils/target.py (3)
1-1: Import platform.machine() for fallback.Some Python builds return empty arch from mac_ver(); fall back to machine().
-from platform import mac_ver +from platform import mac_ver, machine
46-52: Harden Metal detection.Use machine() as a fallback when mac_ver()[2] is empty.
-def check_metal_availability() -> bool: - mac_release, _, arch = mac_ver() - if not mac_release: - return False - # todo: check torch version? - return arch == 'arm64' +def check_metal_availability() -> bool: + mac_release, _, arch = mac_ver() + if not mac_release: + return False + # Fallback for environments where mac_ver() doesn't report machine. + arch = arch or machine() + return arch == "arm64"
84-87: Error message: say “Metal” (and keep it short per TRY003).Also consider updating determine_target docstring to mention Metal.
- else: - raise ValueError("No CUDA or HIP or MPS available on this system.") + else: + raise ValueError("No CUDA, HIP, or Metal available on this system.")Docstring tweak (outside hunk):
""" Determine the appropriate target for compilation (CUDA, HIP, Metal, or manual selection). """tilelang/engine/lower.py (1)
184-186: Parity: add Metal branch to device_codegen as well (or explicitly gate).device_codegen_without_compile supports "metal", but device_codegen lacks a corresponding branch; enabling device compile for Metal will currently raise.
Proposed addition (outside this hunk) in device_codegen():
elif target.kind.name == "metal": device_mod = tvm.ffi.get_global_func("target.build.metal")(device_mod, target)install_metal.sh (1)
20-22: Consider more conservative parallelism settingsUsing half the logical CPU cores might be too aggressive for some systems. Consider using a safer default or allowing configuration via environment variable.
-CORES=$(sysctl -n hw.logicalcpu) -MAKE_JOBS=$(( CORES / 2 )) -make -j${MAKE_JOBS} +CORES=$(sysctl -n hw.logicalcpu) +MAKE_JOBS=${MAKE_JOBS:-$(( CORES / 2 ))} +echo "Building with ${MAKE_JOBS} parallel jobs" +make -j${MAKE_JOBS}tilelang/carver/arch/__init__.py (1)
40-55: Sort all list for better maintainabilityThe static analysis correctly identified that all should be sorted for consistency and easier maintenance.
__all__ = [ - 'is_cpu_arch', - 'is_cuda_arch', - 'is_volta_arch', - 'is_ampere_arch', - 'is_ada_arch', - 'is_hopper_arch', - 'is_tensorcore_supported_precision', - 'has_mma_support', - 'is_cdna_arch', - 'is_metal_arch', + 'CDNA', + 'CPU', 'CUDA', - 'CDNA', 'METAL', - 'CPU', + 'has_mma_support', + 'is_ada_arch', + 'is_ampere_arch', + 'is_cdna_arch', + 'is_cpu_arch', + 'is_cuda_arch', + 'is_hopper_arch', + 'is_metal_arch', + 'is_tensorcore_supported_precision', + 'is_volta_arch', ]tilelang/jit/adapter/wrapper.py (1)
1008-1028: Consider adding docstring and error handling.The new
TLMetalSourceWrapperclass lacks documentation and error handling. Consider adding:
- A class docstring explaining its purpose
- Type hints validation
- Error handling for Metal-specific operations
class TLMetalSourceWrapper(object): + """ + A wrapper class for the TileLang Metal backend. + + This wrapper handles Metal shader code generation for Apple Silicon GPUs. + Unlike CUDA/HIP wrappers, it performs minimal processing as Metal compilation + is handled downstream by PyTorch MPS. + """ def __init__(self, scheduled_ir_module: IRModule, source: str, target: Target, device_mod: Optional[IRModule] = None, host_mod: Optional[IRModule] = None, pass_configs: Optional[Dict[str, Any]] = None): + if not is_metal_target(target): + raise ValueError(f"Expected Metal target, got {target.kind.name}") self.mod = scheduled_ir_module self.target = target self.source = source self.pass_configs = pass_configs self.device_mod = device_mod self.host_mod = host_mod self.lib_code = self.update_lib_code(source) def update_lib_code(self, code: str): + """Update and return the library code without modification. + + Metal shaders don't require the host function wrapping that + CUDA/HIP backends need. + """ self.lib_code = code return self.lib_codetilelang/jit/adapter/torch/metal.py (1)
14-26: Clean up commented parameters or document why they're excluded.The constructor has many commented-out parameters. Either remove them entirely or add a comment explaining why they're excluded from the Metal implementation.
def __init__( self, - # params: List[KernelParam], - # result_idx: List[int], - # target: Union[str, Target], func_or_mod: Union[tir.PrimFunc, tvm.IRModule], - # host_mod: Optional[tvm.IRModule] = None, device_mod: Optional[tvm.IRModule] = None, kernel_global_source: Optional[str] = None, verbose: bool = False, - # pass_configs: Optional[Dict[str, Any]] = None, - # compile_flags: Optional[List[str]] = None ): + """Initialize MetalKernelAdapter for PyTorch MPS execution. + + Note: Unlike other adapters, Metal doesn't require params, result_idx, + or compilation flags as these are handled by PyTorch MPS. + + Args: + func_or_mod: The function or module to compile + device_mod: Device module containing kernel metadata + kernel_global_source: Metal shader source code + verbose: Enable verbose logging + """testing/python/metal/test_metal_codegen.py (1)
70-72: Consider more appropriate tolerance for float16.An absolute tolerance of 1.0 for float16 seems too permissive for GEMM accuracy testing. Consider using a relative tolerance or a smaller absolute tolerance.
def test_gemm_float16(): - assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1) + # Use relative tolerance for float16 to better handle different magnitude results + assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1e-2)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 409ab83 and 0eba92f6ad86fca30149d937543b01069784eafd.
📒 Files selected for processing (19)
CMakeLists.txt(3 hunks)install_metal.sh(1 hunks)testing/python/metal/test_metal_codegen.py(1 hunks)tilelang/cache/kernel_cache.py(1 hunks)tilelang/carver/arch/__init__.py(3 hunks)tilelang/carver/arch/cdna.py(1 hunks)tilelang/carver/arch/cpu.py(1 hunks)tilelang/carver/arch/cuda.py(1 hunks)tilelang/carver/arch/metal.py(1 hunks)tilelang/contrib/cc.py(2 hunks)tilelang/engine/lower.py(1 hunks)tilelang/jit/__init__.py(3 hunks)tilelang/jit/adapter/__init__.py(1 hunks)tilelang/jit/adapter/cython/adapter.py(3 hunks)tilelang/jit/adapter/torch/metal.py(1 hunks)tilelang/jit/adapter/utils.py(1 hunks)tilelang/jit/adapter/wrapper.py(3 hunks)tilelang/jit/kernel.py(4 hunks)tilelang/utils/target.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (10)
tilelang/jit/adapter/utils.py (1)
tilelang/language/ast/ir.py (1)
target(1682-1713)
tilelang/carver/arch/metal.py (2)
tilelang/carver/arch/arch_base.py (1)
TileDevice(4-37)tilelang/carver/template/base.py (1)
arch(152-159)
tilelang/jit/adapter/cython/adapter.py (3)
tilelang/jit/adapter/utils.py (4)
is_cuda_target(51-52)is_hip_target(55-56)is_cpu_target(59-60)is_metal_target(63-64)tilelang/utils/target.py (1)
determine_target(54-96)tilelang/contrib/cc.py (2)
get_cplus_compiler(68-90)is_mac(93-94)
tilelang/jit/adapter/__init__.py (1)
tilelang/jit/adapter/torch/metal.py (1)
MetalKernelAdapter(12-67)
tilelang/jit/kernel.py (3)
tilelang/jit/__init__.py (1)
jit(241-314)tilelang/jit/adapter/utils.py (1)
is_metal_target(63-64)tilelang/jit/adapter/torch/metal.py (1)
MetalKernelAdapter(12-67)
tilelang/carver/arch/__init__.py (1)
tilelang/carver/arch/metal.py (1)
METAL(8-9)
tilelang/jit/__init__.py (2)
tilelang/jit/adapter/utils.py (1)
is_metal_target(63-64)tilelang/language/ast/ir.py (1)
target(1682-1713)
testing/python/metal/test_metal_codegen.py (6)
tilelang/jit/__init__.py (1)
jit(241-314)tilelang/language/allocate.py (2)
alloc_shared(21-36)alloc_fragment(53-64)tilelang/language/fill.py (1)
clear(24-48)tilelang/language/pipeline.py (1)
Pipelined(9-46)tilelang/language/copy.py (1)
copy(84-152)tilelang/jit/kernel.py (1)
kernel_source(470-471)
tilelang/jit/adapter/torch/metal.py (2)
tilelang/jit/adapter/base.py (2)
BaseKernelAdapter(8-55)_post_init(54-55)tilelang/jit/adapter/nvrtc/adapter.py (1)
_convert_torch_func(240-241)
tilelang/jit/adapter/wrapper.py (4)
tilelang/jit/adapter/utils.py (1)
is_metal_target(63-64)tilelang/jit/adapter/cython/adapter.py (1)
lib_code(511-513)tilelang/jit/adapter/ctypes/adapter.py (1)
lib_code(281-283)tilelang/jit/adapter/libgen.py (1)
update_lib_code(53-54)
🪛 Ruff (0.12.2)
tilelang/carver/arch/cpu.py
23-26: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tilelang/carver/arch/cuda.py
150-159: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tilelang/utils/target.py
87-87: Avoid specifying long messages outside the exception class
(TRY003)
tilelang/carver/arch/metal.py
12-15: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tilelang/carver/arch/cdna.py
35-38: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tilelang/jit/adapter/__init__.py
5-5: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
6-6: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
tilelang/carver/arch/__init__.py
2-2: from .cuda import * used; unable to detect undefined names
(F403)
3-3: from .cpu import * used; unable to detect undefined names
(F403)
4-4: from .cdna import * used; unable to detect undefined names
(F403)
5-5: from .metal import * used; unable to detect undefined names
(F403)
22-22: METAL may be undefined, or defined from star imports
(F405)
40-55: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
41-41: is_cpu_arch may be undefined, or defined from star imports
(F405)
42-42: is_cuda_arch may be undefined, or defined from star imports
(F405)
43-43: is_volta_arch may be undefined, or defined from star imports
(F405)
44-44: is_ampere_arch may be undefined, or defined from star imports
(F405)
45-45: is_ada_arch may be undefined, or defined from star imports
(F405)
46-46: is_hopper_arch may be undefined, or defined from star imports
(F405)
47-47: is_tensorcore_supported_precision may be undefined, or defined from star imports
(F405)
48-48: has_mma_support may be undefined, or defined from star imports
(F405)
49-49: is_cdna_arch may be undefined, or defined from star imports
(F405)
50-50: is_metal_arch may be undefined, or defined from star imports
(F405)
51-51: CUDA may be undefined, or defined from star imports
(F405)
52-52: CDNA may be undefined, or defined from star imports
(F405)
53-53: METAL may be undefined, or defined from star imports
(F405)
54-54: CPU may be undefined, or defined from star imports
(F405)
tilelang/jit/adapter/torch/metal.py
44-44: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (9)
tilelang/carver/arch/metal.py (1)
8-9: Confirm TileDevice API expectations for METAL.METAL is currently a stub. If any path calls TileDevice.get_avaliable_tensorintrin_shapes(), this will raise NotImplementedError. If that call path can occur, add a trivial override returning [] to avoid surprises.
Example (outside this hunk):
class METAL(TileDevice): def get_avaliable_tensorintrin_shapes(self): return []tilelang/utils/target.py (1)
16-16: LGTM: add 'metal' to AVALIABLE_TARGETS.This aligns the public set with the new Metal path.
tilelang/jit/adapter/__init__.py (1)
1-6: LGTM!The addition of MetalKernelAdapter to the adapter exports is clean and follows the existing pattern.
tilelang/jit/adapter/utils.py (1)
63-65: LGTM!The
is_metal_targetfunction follows the established pattern and correctly identifies Metal targets.tilelang/jit/adapter/cython/adapter.py (2)
156-158: LGTM! Platform-specific linker flags handled correctlyThe dynamic linker flag selection based on the platform is appropriate for handling undefined symbols across macOS and Linux.
455-456: LGTM! Correct device mapping for Metal targetThe Metal target correctly maps to the "mps" device in PyTorch.
tilelang/jit/adapter/wrapper.py (1)
1068-1069: LGTM! Metal target routing is correctly integrated.The Metal target detection and wrapper selection follows the existing pattern for other backends.
tilelang/jit/adapter/torch/metal.py (1)
61-61: Verify thread calculation logic.The threads calculation multiplies block and grid dimensions element-wise. This seems unusual - typically threads would be just the block dimensions, and the total work would be block * grid.
Could you verify if this is the correct thread configuration for PyTorch MPS? The typical pattern would be:
threads= block dimensions (threads per group)group_size= grid dimensions (number of groups)The current implementation passes
threads=[block[i] * grid[i] for i in range(3)]which seems to represent total threads across all groups.testing/python/metal/test_metal_codegen.py (1)
79-80: Good practice: MPS availability check.The check for MPS availability before running tests is appropriate for Metal backend testing.
| find_library(TVM_LIBRARY_LOCATION | ||
| NAMES tvm | ||
| HINTS "${TVM_PREBUILD_PATH}" | ||
| ) | ||
| set_target_properties(tvm PROPERTIES | ||
| IMPORTED_LOCATION "${TVM_PREBUILD_PATH}/libtvm.so" | ||
| IMPORTED_LOCATION "${TVM_LIBRARY_LOCATION}" | ||
| INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include" | ||
| ) |
There was a problem hiding this comment.
Add error handling for missing TVM libraries
The find_library commands should check if the libraries were actually found before proceeding.
find_library(TVM_LIBRARY_LOCATION
NAMES tvm
HINTS "${TVM_PREBUILD_PATH}"
)
+if(NOT TVM_LIBRARY_LOCATION)
+ message(FATAL_ERROR "Could not find TVM library in ${TVM_PREBUILD_PATH}")
+endif()
set_target_properties(tvm PROPERTIES
IMPORTED_LOCATION "${TVM_LIBRARY_LOCATION}"
INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| find_library(TVM_LIBRARY_LOCATION | |
| NAMES tvm | |
| HINTS "${TVM_PREBUILD_PATH}" | |
| ) | |
| set_target_properties(tvm PROPERTIES | |
| IMPORTED_LOCATION "${TVM_PREBUILD_PATH}/libtvm.so" | |
| IMPORTED_LOCATION "${TVM_LIBRARY_LOCATION}" | |
| INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include" | |
| ) | |
| find_library(TVM_LIBRARY_LOCATION | |
| NAMES tvm | |
| HINTS "${TVM_PREBUILD_PATH}" | |
| ) | |
| if(NOT TVM_LIBRARY_LOCATION) | |
| message(FATAL_ERROR "Could not find TVM library in ${TVM_PREBUILD_PATH}") | |
| endif() | |
| set_target_properties(tvm PROPERTIES | |
| IMPORTED_LOCATION "${TVM_LIBRARY_LOCATION}" | |
| INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include" | |
| ) |
🤖 Prompt for AI Agents
In CMakeLists.txt around lines 102 to 109, the find_library call for TVM does
not verify that TVM_LIBRARY_LOCATION was found; add a check after find_library
such as if(NOT TVM_LIBRARY_LOCATION) message(FATAL_ERROR "Could not find tvm
library in ${TVM_PREBUILD_PATH}; please set TVM_PREBUILD_PATH or install tvm")
to stop configuration when missing, and only call set_target_properties for the
imported tvm target if the library path exists (or create the imported target
conditionally); ensure the error message includes the searched path for easier
debugging.
| find_library(TVM_RUNTIME_LIBRARY_LOCATION | ||
| NAMES tvm_runtime | ||
| HINTS "${TVM_PREBUILD_PATH}" | ||
| ) | ||
| set_target_properties(tvm_runtime PROPERTIES | ||
| IMPORTED_LOCATION "${TVM_PREBUILD_PATH}/libtvm_runtime.so" | ||
| IMPORTED_LOCATION "${TVM_RUNTIME_LIBRARY_LOCATION}" | ||
| INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include" | ||
| ) |
There was a problem hiding this comment.
Add error handling for missing TVM runtime library
Similar to the TVM library, the runtime library location should be validated.
find_library(TVM_RUNTIME_LIBRARY_LOCATION
NAMES tvm_runtime
HINTS "${TVM_PREBUILD_PATH}"
)
+if(NOT TVM_RUNTIME_LIBRARY_LOCATION)
+ message(FATAL_ERROR "Could not find TVM runtime library in ${TVM_PREBUILD_PATH}")
+endif()
set_target_properties(tvm_runtime PROPERTIES
IMPORTED_LOCATION "${TVM_RUNTIME_LIBRARY_LOCATION}"
INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include"
)🤖 Prompt for AI Agents
In CMakeLists.txt around lines 111 to 118, the find_library result for
TVM_RUNTIME_LIBRARY_LOCATION is not validated; add the same existence check and
failure handling used for the TVM library: after find_library, test if
TVM_RUNTIME_LIBRARY_LOCATION is NOTFOUND (or empty) and call message(FATAL_ERROR
"...unable to locate tvm_runtime library at ${TVM_PREBUILD_PATH}..." ) so
configuration stops with a clear error; only set_target_properties for
tvm_runtime and INTERFACE_INCLUDE_DIRECTORIES when the library was successfully
found (mirror the pattern used for the TVM library block).
| if execution_backend in ("dlpack", "torch"): | ||
| self.logger.warning("DLPack or torch backend does not support cache saving to disk.") | ||
| else: |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Include 'torch' in type hints and clarify log.
Execution backend Literal types in this file exclude 'torch' (class attr and method params). This will trip type-checkers now that you gate disk-cache on 'torch'.
Apply minimal log tweak here:
- if execution_backend in ("dlpack", "torch"):
- self.logger.warning("DLPack or torch backend does not support cache saving to disk.")
+ if execution_backend in ("dlpack", "torch"):
+ self.logger.warning("Backends 'dlpack' and 'torch' do not support saving cache to disk.")Also update type hints elsewhere in this file (outside the changed hunk):
# line ~42
execution_backend: Literal["dlpack", "ctypes", "cython", "nvrtc", "torch"] = "cython"
# line ~71
execution_backend: Literal["dlpack", "ctypes", "cython", "nvrtc", "torch"] = "cython",
# line ~117
execution_backend: Literal["dlpack", "ctypes", "cython", "nvrtc", "torch"] = "cython",
# line ~323
execution_backend: Literal["dlpack", "ctypes", "cython", "nvrtc", "torch"] = "cython",🤖 Prompt for AI Agents
In tilelang/cache/kernel_cache.py around lines 189-191, the warning message and
the execution_backend type hints omit 'torch', causing type-check failures and
unclear logs; update the warning to mention "DLPack or torch backend does not
support saving cache to disk" (or similar clearer phrasing) and add "torch" to
the Literal union type hints at the listed locations: ~42, ~71, ~117, ~323 so
those class attributes and method params read Literal["dlpack", "ctypes",
"cython", "nvrtc", "torch"] = "cython" (or the equivalent default), ensuring
consistency across the file.
| from .cuda import * | ||
| from .cpu import * | ||
| from .cdna import * | ||
| from .metal import * |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Wildcard imports reduce code clarity and maintainability
The wildcard imports make it difficult to track which symbols come from which module and can lead to namespace pollution. Consider explicit imports for better maintainability.
🧰 Tools
🪛 Ruff (0.12.2)
2-2: from .cuda import * used; unable to detect undefined names
(F403)
3-3: from .cpu import * used; unable to detect undefined names
(F403)
4-4: from .cdna import * used; unable to detect undefined names
(F403)
5-5: from .metal import * used; unable to detect undefined names
(F403)
🤖 Prompt for AI Agents
In tilelang/carver/arch/__init__.py around lines 2 to 5, the module currently
uses wildcard imports (from .cuda import * etc.), which pollutes the package
namespace and obscures which symbols are exported; replace these with explicit
imports of the public symbols you need from each submodule (or import the
modules themselves and re-export a curated __all__ list) so callers see a clear
API and accidental name collisions are avoided. Ensure you list each
function/class/constant you want to expose (or import submodules as namespaced
attributes) and update or add __all__ to reflect the exported names.
| if is_metal_target(target): | ||
| assert execution_backend == 'torch', 'Currently metal target only support `tl.jit(execution_backend="torch")`' |
There was a problem hiding this comment.
Fix inconsistent backend naming
The assertion expects 'torch' backend but the public API and documentation use 'nvrtc'. This inconsistency is confusing.
The error message should clarify that Metal targets require the "nvrtc" backend (which internally uses torch for Metal):
if is_metal_target(target):
- assert execution_backend == 'torch', 'Currently metal target only support `tl.jit(execution_backend="torch")`'
+ assert execution_backend == 'nvrtc', 'Currently metal target only supports `tl.jit(execution_backend="nvrtc")`'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if is_metal_target(target): | |
| assert execution_backend == 'torch', 'Currently metal target only support `tl.jit(execution_backend="torch")`' | |
| if is_metal_target(target): | |
| assert execution_backend == 'nvrtc', 'Currently metal target only supports `tl.jit(execution_backend="nvrtc")`' |
🤖 Prompt for AI Agents
In tilelang/jit/__init__.py around lines 78-79, the assertion currently checks
for execution_backend == 'torch' but the public API/doc use 'nvrtc'; change the
assertion to require 'nvrtc' instead and update the assertion message to state
that Metal targets require the "nvrtc" backend (which internally uses torch) so
the message aligns with public API and avoids confusion.
| self.kernel_name = func_or_mod.__name__ + '_kernel' | ||
| self.verbose = verbose |
There was a problem hiding this comment.
Potential issue with kernel name generation.
The kernel name is generated by appending '_kernel' to the function name. This assumes func_or_mod has a __name__ attribute, which may not be true for all IRModule instances.
self.kernel_global_source = kernel_global_source
- self.kernel_name = func_or_mod.__name__ + '_kernel'
+ if hasattr(func_or_mod, '__name__'):
+ self.kernel_name = func_or_mod.__name__ + '_kernel'
+ elif hasattr(func_or_mod, 'get_global_vars'):
+ # Handle IRModule case
+ global_vars = func_or_mod.get_global_vars()
+ if global_vars:
+ self.kernel_name = global_vars[0].name_hint + '_kernel'
+ else:
+ raise ValueError("Cannot determine kernel name from IRModule")
+ else:
+ raise ValueError(f"Cannot determine kernel name from {type(func_or_mod)}")
self.verbose = verbose📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.kernel_name = func_or_mod.__name__ + '_kernel' | |
| self.verbose = verbose | |
| self.kernel_global_source = kernel_global_source | |
| if hasattr(func_or_mod, '__name__'): | |
| self.kernel_name = func_or_mod.__name__ + '_kernel' | |
| elif hasattr(func_or_mod, 'get_global_vars'): | |
| # Handle IRModule case | |
| global_vars = func_or_mod.get_global_vars() | |
| if global_vars: | |
| self.kernel_name = global_vars[0].name_hint + '_kernel' | |
| else: | |
| raise ValueError("Cannot determine kernel name from IRModule") | |
| else: | |
| raise ValueError(f"Cannot determine kernel name from {type(func_or_mod)}") | |
| self.verbose = verbose |
🤖 Prompt for AI Agents
In tilelang/jit/adapter/torch/metal.py around lines 28-29, kernel_name is set by
blindly using func_or_mod.__name__ which can fail for IRModule or other objects
without a __name__; change to safely derive a base name: try
getattr(func_or_mod, "__name__", None), then getattr(func_or_mod, "name", None),
and if both are missing fall back to a stable fallback such as
f"{type(func_or_mod).__name__}_{id(func_or_mod):x}"; sanitize the base to be a
valid identifier if needed and then set self.kernel_name = f"{base}_kernel".
| for var, func in device_mod.functions.items(): | ||
| assert var.name_hint == self.kernel_name | ||
| thread_extent = func.attrs['thread_extent'] | ||
| for tag, extent in thread_extent.items(): | ||
| if "threadIdx" in tag: | ||
| self.block_info["xyz".index(tag[-1])] = extent | ||
| elif "blockIdx" in tag: | ||
| self.grid_info["xyz".index(tag[-1])] = extent | ||
| break | ||
| else: | ||
| raise AssertionError(f'no kernel with name {func_or_mod.__name__}') | ||
|
|
There was a problem hiding this comment.
Fix critical issue: iteration over device module functions doesn't handle multiple functions correctly.
The loop breaks after processing the first function regardless of whether it matches kernel_name. This could silently process the wrong kernel if multiple functions exist.
for var, func in device_mod.functions.items():
- assert var.name_hint == self.kernel_name
- thread_extent = func.attrs['thread_extent']
- for tag, extent in thread_extent.items():
- if "threadIdx" in tag:
- self.block_info["xyz".index(tag[-1])] = extent
- elif "blockIdx" in tag:
- self.grid_info["xyz".index(tag[-1])] = extent
- break
+ if var.name_hint == self.kernel_name:
+ thread_extent = func.attrs.get('thread_extent', {})
+ for tag, extent in thread_extent.items():
+ if "threadIdx" in tag:
+ self.block_info["xyz".index(tag[-1])] = extent
+ elif "blockIdx" in tag:
+ self.grid_info["xyz".index(tag[-1])] = extent
+ break
else:
- raise AssertionError(f'no kernel with name {func_or_mod.__name__}')
+ raise AssertionError(f'No kernel with name {self.kernel_name} found in device module')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for var, func in device_mod.functions.items(): | |
| assert var.name_hint == self.kernel_name | |
| thread_extent = func.attrs['thread_extent'] | |
| for tag, extent in thread_extent.items(): | |
| if "threadIdx" in tag: | |
| self.block_info["xyz".index(tag[-1])] = extent | |
| elif "blockIdx" in tag: | |
| self.grid_info["xyz".index(tag[-1])] = extent | |
| break | |
| else: | |
| raise AssertionError(f'no kernel with name {func_or_mod.__name__}') | |
| for var, func in device_mod.functions.items(): | |
| if var.name_hint == self.kernel_name: | |
| thread_extent = func.attrs.get('thread_extent', {}) | |
| for tag, extent in thread_extent.items(): | |
| if "threadIdx" in tag: | |
| self.block_info["xyz".index(tag[-1])] = extent | |
| elif "blockIdx" in tag: | |
| self.grid_info["xyz".index(tag[-1])] = extent | |
| break | |
| else: | |
| raise AssertionError(f'No kernel with name {self.kernel_name} found in device module') |
🧰 Tools
🪛 Ruff (0.12.2)
44-44: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In tilelang/jit/adapter/torch/metal.py around lines 34 to 45, the loop over
device_mod.functions currently asserts the first function has the desired
kernel_name and breaks unconditionally, which can process the wrong function
when multiple functions exist; change the logic to search for a function whose
var.name_hint equals self.kernel_name (don't assert immediately), extract its
thread_extent and populate block_info/grid_info, break when found, and after the
loop raise an AssertionError if no matching kernel was found (use
self.kernel_name in the error message) so only the correct kernel is processed.
|
|
||
| if self._kernel is None: | ||
|
|
||
| _kernel = getattr(torch.mps.compile_shader(self.kernel_global_source), self.kernel_name) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add error handling for shader compilation.
The shader compilation could fail but there's no error handling. Also, the compiled shader might not contain the expected kernel.
- _kernel = getattr(torch.mps.compile_shader(self.kernel_global_source), self.kernel_name)
+ try:
+ compiled_shader = torch.mps.compile_shader(self.kernel_global_source)
+ if not hasattr(compiled_shader, self.kernel_name):
+ raise AttributeError(f"Compiled shader does not contain kernel '{self.kernel_name}'")
+ _kernel = getattr(compiled_shader, self.kernel_name)
+ except Exception as e:
+ raise RuntimeError(f"Failed to compile Metal shader: {e}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _kernel = getattr(torch.mps.compile_shader(self.kernel_global_source), self.kernel_name) | |
| try: | |
| compiled_shader = torch.mps.compile_shader(self.kernel_global_source) | |
| if not hasattr(compiled_shader, self.kernel_name): | |
| raise AttributeError(f"Compiled shader does not contain kernel '{self.kernel_name}'") | |
| _kernel = getattr(compiled_shader, self.kernel_name) | |
| except Exception as e: | |
| raise RuntimeError(f"Failed to compile Metal shader: {e}") |
🤖 Prompt for AI Agents
In tilelang/jit/adapter/torch/metal.py around line 54, wrap the call to
torch.mps.compile_shader and subsequent attribute access in a try/except and
validate the returned object contains the expected kernel; specifically, catch
exceptions from compile_shader, raise or log a clear error that includes the
original exception and the kernel name/source context, and after compilation
check with hasattr or explicit lookup that the compiled object exposes
self.kernel_name, raising a descriptive error if missing so callers know
compilation succeeded but the kernel was not found.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tilelang/jit/adapter/__init__.py (1)
5-6: Remove unused# noqa: F401(RUF100) on new exports.Ruff flags these as unused because F401 isn’t enabled; just drop the directives.
-from .nvrtc import NVRTCKernelAdapter # noqa: F401 -from .torch import MetalKernelAdapter # noqa: F401 +from .nvrtc import NVRTCKernelAdapter +from .torch import MetalKernelAdapterOptional: make exports explicit to avoid needing noqa anywhere and document the public API.
# add near the top of this module __all__ = [ "BaseKernelAdapter", "TorchDLPackKernelAdapter", "CtypesKernelAdapter", "CythonKernelAdapter", "NVRTCKernelAdapter", "MetalKernelAdapter", ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 0eba92f6ad86fca30149d937543b01069784eafd and 7514cf545a9b9ec3d999fc1b1f6bd098397d19e1.
📒 Files selected for processing (2)
tilelang/jit/adapter/__init__.py(1 hunks)tilelang/jit/adapter/torch/__init__.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tilelang/jit/adapter/torch/init.py
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/jit/adapter/__init__.py (2)
tilelang/jit/adapter/nvrtc/adapter.py (1)
NVRTCKernelAdapter(30-246)tilelang/jit/adapter/torch/metal.py (1)
MetalKernelAdapter(12-67)
🪛 Ruff (0.12.2)
tilelang/jit/adapter/__init__.py
5-5: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
6-6: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
Removed external library DLL names for macOS. found during tile-ai/tilelang#799 cc @LeiWang1999
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
setup.py (3)
345-347: Wheel may miss runtime libs: include tilelang/lib/.so|.dylib in package_data.Without this, copied libs might not land in wheels/sdists.
package_data = { - "tilelang": ["py.typed", "*pyx"], + "tilelang": ["py.typed", "*pyx", "lib/*.so", "lib/*.dylib"], }
805-807: Do not pass CMAKE_CUDA_COMPILER unless CUDA is enabled.On Metal-only/mac builds, this inserts a bogus nvcc path and breaks configure.
- if not USE_ROCM: - cmake_args.append(f"-DCMAKE_CUDA_COMPILER={os.path.join(CUDA_HOME, 'bin', 'nvcc')}") + if USE_CUDA and CUDA_HOME: + cmake_args.append(f"-DCMAKE_CUDA_COMPILER={os.path.join(CUDA_HOME, 'bin', 'nvcc')}")
826-839: Respect USE_CUDA toggle in config.cmake generation.Current logic enables CUDA whenever CUDA_HOME is set, even if USE_CUDA=False.
- elif CUDA_HOME: + elif USE_CUDA and CUDA_HOME: content_lines += [ f"set(USE_CUDA {CUDA_HOME})", "set(USE_ROCM OFF)", ] + else: + content_lines += [ + "set(USE_CUDA OFF)", + "set(USE_ROCM OFF)", + ]
🧹 Nitpick comments (9)
setup.py (9)
48-50: Harden Metal detection (Rosetta/process-arch pitfalls).Use Darwin + machine for reliability.
-IS_LINUX = platform.system() == 'Linux' -MAYBE_METAL = platform.mac_ver()[2] == 'arm64' +IS_LINUX = platform.system() == 'Linux' +IS_DARWIN = platform.system() == 'Darwin' +MAYBE_METAL = IS_DARWIN and platform.machine() == 'arm64'
52-61: Defaulting USE_CUDA=True on Linux is user-hostile (raises when CUDA_HOME missing).Consider defaulting to False or detecting from CUDA_HOME post-import to avoid hard failures on non-GPU Linux dev boxes.
If you want a minimal tweak now, set default False:
-USE_CUDA = _read_bool_env("USE_CUDA", IS_LINUX and not USE_ROCM) +USE_CUDA = _read_bool_env("USE_CUDA", False)
99-103: Tighten exception text (Ruff TRY003) and clarity.Shorten the message; keep actionable content.
- raise ValueError( - "CUDA support is enabled by default on linux if `USE_ROCM=False`," \ - " but CUDA_HOME is not set or detected.") + raise ValueError("USE_CUDA=True but CUDA_HOME is not set or detected.")
105-108: Same as above: shorten/standardize error message.- raise ValueError( - "Failed to automatically detect CUDA or ROCM installation. Please set the CUDA_HOME or ROCM_HOME environment variable manually (e.g., export CUDA_HOME=/usr/local/cuda or export ROCM_HOME=/opt/rocm)." - ) + raise ValueError("Neither CUDA_HOME nor ROCM_HOME detected on Linux; set one explicitly.")
482-489: Guard dSYM deletion.Use isdir check to avoid accidental rmtree on files with matching names.
- for dir in potential_dirs: - candidate = os.path.join(dir, item) - if os.path.exists(candidate): - shutil.rmtree(candidate) - break + for d in potential_dirs: + candidate = os.path.join(d, item) + if os.path.isdir(candidate): + shutil.rmtree(candidate) + break
664-664: Copy-from-glob likely no-ops in non-inplace builds.CMake outputs to extdir/build_temp, not CWD. Either search extdir/build_temp or drop this block (TileLangBuilPydCommand already handles copying).
270-273: Handle missing CMake gracefully (None from shutil.which).os.path.exists(None) raises TypeError; raise a clear FileNotFoundError instead.
- cmake_path = shutil.which("cmake") - if not os.path.exists(cmake_path): - raise Exception("CMake is not installed, please install it first.") + cmake_path = shutil.which("cmake") + if not cmake_path or not os.path.exists(cmake_path): + raise FileNotFoundError("CMake is not installed or not on PATH.")
149-168: Make ROCm version detection robust if rocminfo is missing.Without try/except, subprocess errors abort the build even when the cmake file can be parsed.
-def get_rocm_version(): +def get_rocm_version(): """Get the ROCM version from rocminfo.""" - rocm_output = subprocess.check_output(["rocminfo"], universal_newlines=True) - # Parse ROCM version from output - # Example output: ROCM version: x.y.z-... - match = re.search(r'ROCm Version: (\d+\.\d+\.\d+)', rocm_output) - if match: - return Version(match.group(1)) - else: + try: + rocm_output = subprocess.check_output(["rocminfo"], universal_newlines=True) + match = re.search(r'ROCm Version: (\d+\.\d+\.\d+)', rocm_output) + if match: + return Version(match.group(1)) + except (subprocess.CalledProcessError, FileNotFoundError): + pass rocm_path = os.environ.get("ROCM_PATH", "/opt/rocm") rocm_version_file = os.path.join(rocm_path, "lib", "cmake", "rocm", "rocm-config-version.cmake") if os.path.exists(rocm_version_file): with open(rocm_version_file, "r") as f: content = f.read() match = re.search(r'set\(PACKAGE_VERSION "(\d+\.\d+\.\d+)"', content) if match: return Version(match.group(1)) # return a default return Version("5.0.0")
882-895: Update PyPI metadata for Metal/macOS builds.Expose macOS in platforms/classifiers when USE_METAL is on; optionally add/remove CUDA/ROCm env classifiers based on toggles.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 7514cf545a9b9ec3d999fc1b1f6bd098397d19e1 and e1e18873a51c22fce0d9495e989c7253e871f99f.
📒 Files selected for processing (1)
setup.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
setup.py (1)
tilelang/env.py (1)
get(136-139)
🪛 Ruff (0.12.2)
setup.py
100-102: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
🔇 Additional comments (1)
setup.py (1)
390-392: Linux-only rpath patch exit is correct.Early-return avoids patchelf on macOS. LGTM.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
setup.py (3)
808-811: Fix unconditional CUDA compiler flag (breaks Metal/macOS and CUDA-less builds).
-DCMAKE_CUDA_COMPILERis appended whenever ROCm is disabled, even if CUDA is not requested orCUDA_HOMEis unset. On macOS/Metal this can crash with a TypeError (joining path withNone) and on Linux CPU/ROCm builds it forces CUDA.Apply:
- if not USE_ROCM: - cmake_args.append(f"-DCMAKE_CUDA_COMPILER={os.path.join(CUDA_HOME, 'bin', 'nvcc')}") + # Only set CUDA compiler when CUDA is explicitly enabled and detected + if USE_CUDA and CUDA_HOME: + cmake_args.append(f"-DCMAKE_CUDA_COMPILER={os.path.join(CUDA_HOME, 'bin', 'nvcc')}")
667-676: Editable install: also copy .so on macOS.Python C-extensions on macOS commonly end with .so. Current logic only grabs .dylib on macOS, causing missing libs in editable mode.
- files = glob.glob("*.so" if IS_LINUX else "*.dylib") + files = sorted(set(glob.glob("*.so") + glob.glob("*.dylib")))
269-277: Handle missing CMake path; avoid TypeError onos.path.exists(None).
shutil.which("cmake")may return None; callingos.path.exists(None)raises a TypeError.def get_cmake_path() -> str: @@ - cmake_path = shutil.which("cmake") - if not os.path.exists(cmake_path): + cmake_path = shutil.which("cmake") + if not cmake_path or not os.path.exists(cmake_path): raise Exception("CMake is not installed, please install it first.") return cmake_path
♻️ Duplicate comments (1)
setup.py (1)
36-44: Boolean env parsing still brittle (whitespace/yes-no not handled; empty string coerced to False).You addressed the earlier issue but missed
.strip()and common synonyms, and you now treat an empty string as False rather than “unset → default”.Apply:
-def _read_bool_env(name: str, default: bool = False) -> bool: - if env := os.environ.get(name): - env = env.lower() - if env in ['on', '1', 'true']: - return True - elif env in ['', 'off', '0', 'false']: - return False - return default +def _read_bool_env(name: str, default: bool = False) -> bool: + val = os.environ.get(name) + if val is None: + return default + s = val.strip().lower() + if s in ("1", "true", "on", "yes", "y"): + return True + if s in ("0", "false", "off", "no", "n"): + return False + return default
🧹 Nitpick comments (4)
setup.py (4)
51-53: Make Metal default detection more robust.Use explicit Darwin + machine check;
platform.mac_ver()[2]is brittle and can vary.-IS_LINUX = platform.system() == 'Linux' -MAYBE_METAL = platform.mac_ver()[2] == 'arm64' +IS_LINUX = platform.system() == 'Linux' +MAYBE_METAL = (platform.system() == 'Darwin' and platform.machine() == 'arm64')
829-843: When enabling Metal, explicitly disable CUDA/ROCm in config.cmake for clarity.Prevents accidental dual-backend config if upstream defaults change.
- if USE_METAL: - content_lines += [ - "set(USE_METAL ON)", - ] + if USE_METAL: + content_lines += [ + "set(USE_METAL ON)", + "set(USE_CUDA OFF)", + "set(USE_ROCM OFF)", + ]
893-898: Reflect macOS support in Trove classifiers.Now that Metal/macOS is supported, advertise it.
"Programming Language :: Python :: 3.8", "License :: OSI Approved :: MIT License", - "Operating System :: OS Independent", + "Operating System :: OS Independent", + "Operating System :: MacOS :: MacOS X", "Intended Audience :: Developers", "Intended Audience :: Science/Research",
102-111: Minor: tighten error messages and conditions (style).Long exception messages trigger Ruff TRY003; consider shorter texts or helper constants. Also consider allowing Linux CPU-only builds in the future by gating the “require CUDA/ROCM” check behind an explicit flag.
If you want, I can open a follow-up to align linter expectations and make the Linux GPU requirement opt-in via an env (e.g., REQUIRE_GPU=1).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between e1e18873a51c22fce0d9495e989c7253e871f99f and cc91c3e22afffbb80f1dbec199fd55d91748adc5.
📒 Files selected for processing (1)
setup.py(6 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
setup.py
103-105: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
tilelang/carver/arch/metal.py (1)
17-20: Sort all to satisfy Ruff RUF022.__all__ = [ - 'is_metal_arch', - 'METAL', + 'METAL', + 'is_metal_arch', ].github/workflows/metal_ci.yml (2)
12-14: Tighten permissions and fix trailing space.
contents: writeisn’t needed; also removes the trailing space flagged by yamllint.- permissions: - contents: write + permissions: + contents: read
96-97: Avoid double installation of the project.You install in “Ensure venv” and again in “Install project.” Keep one to reduce time and flakiness.
- pip install . - touch "$MARKER" + touch "$MARKER"(Keep the dedicated “Install project” step.)
Also applies to: 100-104
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between cc91c3e22afffbb80f1dbec199fd55d91748adc5 and 12ca66077563253c09d1530c948b9fcf28cbbdd1.
📒 Files selected for processing (2)
.github/workflows/metal_ci.yml(1 hunks)tilelang/carver/arch/metal.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/carver/arch/metal.py (2)
tilelang/language/ast/ir.py (1)
target(1682-1713)tilelang/carver/arch/arch_base.py (1)
TileDevice(4-37)
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
22-22: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
76-76: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 13-13: trailing spaces
(trailing-spaces)
🪛 Ruff (0.12.2)
tilelang/carver/arch/metal.py
17-20: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: format-check
- GitHub Check: bot-task
🔇 Additional comments (2)
tilelang/carver/arch/metal.py (1)
6-7: Helper reads well.
is_metal_archis straightforward and correct..github/workflows/metal_ci.yml (1)
1-1: Rename workflow and use Metal CMake flag.
CMakeLists.txt contains a USE_METAL guard — switch -DUSE_ROCM=ON to -DUSE_METAL=ON and rename the workflow.-name: CI Test on AMD +name: Metal CI- cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_ROCM=ON; cd .. + cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd ..
| def __init__(self, target: Target | str): | ||
| if isinstance(target, str): | ||
| target = Target(target) | ||
| self.target = target | ||
|
|
There was a problem hiding this comment.
Call the base initializer to populate TileDevice fields.
Without super().__init__(), METAL instances won't have the base attributes (reg_cap, smem_cap, etc.), which can break consumers expecting them.
Apply:
class METAL(TileDevice):
def __init__(self, target: Target | str):
+ super().__init__()
if isinstance(target, str):
target = Target(target)
self.target = target📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__(self, target: Target | str): | |
| if isinstance(target, str): | |
| target = Target(target) | |
| self.target = target | |
| def __init__(self, target: Target | str): | |
| super().__init__() | |
| if isinstance(target, str): | |
| target = Target(target) | |
| self.target = target |
🤖 Prompt for AI Agents
In tilelang/carver/arch/metal.py around lines 11 to 15, the METAL class __init__
does not call the base class initializer so base TileDevice fields (e.g.,
reg_cap, smem_cap) are not populated; update the constructor to call
super().__init__(target) (after converting a string target to Target if needed)
so the parent initializer runs and base attributes are set, ensuring the
instance has the expected fields.
Removed external library DLL names for macOS. found during tile-ai/tilelang#799 cc @LeiWang1999
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (8)
tilelang/carver/arch/metal.py (1)
11-14: Call TileDevice.init to populate base fields (duplicate of prior feedback).Without
super().__init__(), instances miss required attributes (reg_cap, smem_cap, etc.), leading to runtime errors when accessed.Apply:
class METAL(TileDevice): def __init__(self, target: Target | str): + super().__init__() if isinstance(target, str): target = Target(target) self.target = target.github/workflows/metal_ci.yml (7)
21-25: Update setup-python to v5 (v2 is deprecated on GH runners).actionlint flags v2 as too old; move to v5.
- - name: Set up Python - uses: actions/setup-python@v2 + - name: Set up Python + uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }}
29-31: Fix hashing on macOS: use shasum -a 256.sha256sum isn’t available on macOS. Use shasum.
- REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements") + REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements") MARKER="${{ runner.tool_cache }}/.venv_marker_${{ env.PYTHON_VERSION }}_${REQS_HASH:0:8}"
40-43: Remove invalid pip flag --no-user.pip doesn’t support --no-user here (even in venvs).
- python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip - [[ -f requirements-test.txt ]] && \ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user + [[ -f requirements-test.txt ]] && \ + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt
75-79: Update setup-python to v5 here as well.- - name: Set up Python - uses: actions/setup-python@v2 + - name: Set up Python + uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }}
83-85: Use a stable requirements hash (shasum) instead of cat.cat doesn’t produce a fixed-length digest and weakens cache keys.
- REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true) + REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements") MARKER="${{ runner.tool_cache }}/.venv_marker_${{ env.PYTHON_VERSION }}_${REQS_HASH:0:8}"
93-97: Remove invalid --no-user and avoid double install.Fix flags; also consider skipping the early install since you reinstall later.
source "${{ runner.tool_cache }}/${{ env.VENV_DIR }}/bin/activate" - python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip [[ -f requirements-test.txt ]] && \ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user - pip install . --no-user + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt + # Defer project install to the dedicated step below
100-104: Remove --no-user in the wheel install step.- pip install . --no-user -v + pip install . -v
🧹 Nitpick comments (6)
tilelang/carver/arch/metal.py (3)
17-20: Sort all to satisfy Ruff RUF022.__all__ = [ - 'is_metal_arch', - 'METAL', + 'METAL', + 'is_metal_arch', ]
11-14: Optionally validate Target kind is metal.Prevents accidental misuse with a non-Metal TVM target.
def __init__(self, target: Target | str): super().__init__() if isinstance(target, str): target = Target(target) + # Ensure we were given a Metal target + if getattr(getattr(target, "kind", None), "name", None) != "metal": + raise ValueError(f"METAL requires a TVM Target(kind='metal'), got: {getattr(getattr(target, 'kind', None), 'name', None)}") self.target = targetPlease confirm the TVM version exposes
target.kind.name; if not, we can adapt the check.
9-15: Confirm whether get_avaliable_tensorintrin_shapes is required for Metal paths.If invoked downstream, not overriding it will raise NotImplementedError from the base class.
Proposed minimal safe stub:
class METAL(TileDevice): def __init__(self, target: Target | str): super().__init__() if isinstance(target, str): target = Target(target) self.target = target + + def get_avaliable_tensorintrin_shapes(self): + # Metal/MPS: no tensor intrinsics exposed yet + return [].github/workflows/metal_ci.yml (3)
12-14: Trim trailing space and drop unnecessary write permission.Set minimal perms and remove the trailing space flagged by yamllint.
permissions: - contents: write + contents: read
10-10: Pin runner to Apple Silicon if you require MPS/Metal.macos-latest can drift; pin to macos-14 (Apple Silicon) to ensure MPS availability.
- runs-on: [macos-latest] + runs-on: macos-14(Apply to both jobs.)
Also applies to: 63-63
26-45: Rethink venv “persistence”; use actions/cache instead.runner.tool_cache doesn’t persist across jobs/runs on hosted macOS. Use actions/cache keyed by Python and requirements hash.
Example step to add before “Ensure venv”:
- name: Cache venv uses: actions/cache@v4 with: path: ${{ runner.tool_cache }}/${{ env.VENV_DIR }} key: venv-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements*.txt', 'pyproject.toml', 'setup.cfg', 'setup.py') }}Then create venv if missing and install deps; drop the custom MARKER logic.
Also applies to: 80-98
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 12ca66077563253c09d1530c948b9fcf28cbbdd1 and d546487de7e3a2031c9106e89210669f666bbccd.
📒 Files selected for processing (2)
.github/workflows/metal_ci.yml(1 hunks)tilelang/carver/arch/metal.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/carver/arch/metal.py (2)
tilelang/carver/arch/arch_base.py (1)
TileDevice(4-37)tilelang/carver/template/base.py (1)
arch(152-159)
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
22-22: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
76-76: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 13-13: trailing spaces
(trailing-spaces)
🪛 Ruff (0.12.2)
tilelang/carver/arch/metal.py
17-20: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
🔇 Additional comments (3)
tilelang/carver/arch/metal.py (1)
5-6: Helper looks good..github/workflows/metal_ci.yml (2)
51-51: Drop ROCm flag in Metal CI configure step.This job targets Metal; forcing -DUSE_ROCM=ON on macOS is likely wrong and may fail. Prefer no ROCm flag (or a Metal-specific one if your CMake supports it).
- cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_ROCM=ON; cd .. + cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON; cd ..If your build supports it, consider adding: -DUSE_METAL=ON.
105-111: pytest-timeout declared — no change requiredpytest-timeout is listed in requirements-test.txt (line 16); --timeout is supported so keep the workflow as-is.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
setup.py (2)
269-276: Fix possible TypeError when CMake isn't installed.shutil.which can return None; os.path.exists(None) crashes.
def get_cmake_path() -> str: """Return the path to the CMake compiler. """ - # found which cmake is used - cmake_path = shutil.which("cmake") - if not os.path.exists(cmake_path): + cmake_path = shutil.which("cmake") + if not cmake_path or not os.path.exists(cmake_path): raise Exception("CMake is not installed, please install it first.") return cmake_path
829-844: Ensure CUDA is disabled when building for Metal.Without explicitly setting USE_CUDA OFF here, CMake template defaults could enable CUDA inadvertently.
if USE_METAL: content_lines += [ "set(USE_METAL ON)", "set(USE_ROCM OFF)", + "set(USE_CUDA OFF)", ]
♻️ Duplicate comments (8)
setup.py (1)
36-43: Boolean env parsing still wrong; empty/unknown values fall back to default and '' branch is unreachable.Normalize and handle both true/false tokens (incl. empty string). Current logic can flip GPU toggles unexpectedly.
Apply:
-def _read_bool_env(name: str, default: bool = False) -> bool: - if env := os.environ.get(name): - env = env.lower() - if env in ['on', '1', 'true']: - return True - elif env in ['', 'off', '0', 'false']: - return False - return default +def _read_bool_env(name: str, default: bool = False) -> bool: + val = os.environ.get(name) + if val is None: + return default + s = val.strip().lower() + if s in ('1', 'true', 'on', 'yes', 'y'): + return True + if s in ('0', 'false', 'off', 'no', 'n', ''): + return False + return default.github/workflows/metal_ci.yml (7)
93-97: Remove --no-user and avoid duplicate installs.Install requirements here; leave project install to the dedicated step below.
- python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip [[ -f requirements-test.txt ]] && \ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user - pip install . --no-user + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt
21-25: Upgrade setup-python to v5.v2 is deprecated and fails on current runners.
- - name: Set up Python - uses: actions/setup-python@v2 + - name: Set up Python + uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }}
76-79: Upgrade setup-python to v5 (second job).- - name: Set up Python - uses: actions/setup-python@v2 + - name: Set up Python + uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }}
29-31: sha256sum isn’t available on macOS runners; use shasum -a 256.- REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements") + REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")
83-85: Consistently hash requirements; don’t use cat.- REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true) + REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")
40-43: Drop invalid pip flag --no-user.This option doesn’t exist and can fail even inside venvs.
- python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip - [[ -f requirements-test.txt ]] && \ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user + [[ -f requirements-test.txt ]] && \ + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt
101-104: Remove --no-user (wheel install step).- pip install . --no-user -v + pip install . -v
🧹 Nitpick comments (5)
setup.py (4)
51-53: More robust Metal default detection.Prefer platform checks that don’t rely on mac_ver quirks.
-IS_LINUX = platform.system() == 'Linux' -MAYBE_METAL = platform.mac_ver()[2] == 'arm64' +IS_LINUX = platform.system() == 'Linux' +MAYBE_METAL = (platform.system() == 'Darwin' and platform.machine() == 'arm64')
102-106: Shorten exception message (Ruff TRY003) and keep details in logs.Long inline messages trigger TRY003 and add noise.
-if USE_CUDA and not CUDA_HOME: - raise ValueError( - "CUDA support is enabled by default on linux if `USE_ROCM=False`, but CUDA_HOME is not set or detected.") +if USE_CUDA and not CUDA_HOME: + raise ValueError("CUDA enabled but CUDA_HOME not set/detected.")
463-484: Use logger.warning instead of “WARNING:” via info.This preserves severity and integrates with log filters.
- logger.info(f"WARNING: {item} not found in any expected directories!") + logger.warning(f"{item} not found in any expected directories!")
886-889: Update metadata to reflect macOS/Metal support.Expose correct OS in metadata; remove Linux-only bias when targeting Metal.
- platforms=[ - "Environment :: GPU :: NVIDIA CUDA" if not USE_ROCM else "Environment :: GPU :: AMD ROCm", - "Operating System :: POSIX :: Linux", - ], + platforms=[ + ("Environment :: GPU :: NVIDIA CUDA" if USE_CUDA else + "Environment :: GPU :: AMD ROCm" if USE_ROCM else + "Environment :: GPU"), + "Operating System :: MacOS :: MacOS X" if USE_METAL else "Operating System :: POSIX :: Linux", + ],- "Operating System :: OS Independent", + ("Operating System :: MacOS :: MacOS X" if USE_METAL else "Operating System :: POSIX :: Linux"),Also applies to: 893-900
.github/workflows/metal_ci.yml (1)
13-13: Trailing space.Minor YAML lint fix.
- contents: write + contents: write
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between d546487de7e3a2031c9106e89210669f666bbccd and 5b306179cb8b24f66118ff9847bad7a1847e82ff.
📒 Files selected for processing (2)
.github/workflows/metal_ci.yml(1 hunks)setup.py(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
setup.py (1)
tilelang/env.py (1)
get(136-139)
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
22-22: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
76-76: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 13-13: trailing spaces
(trailing-spaces)
🪛 Ruff (0.12.2)
setup.py
103-105: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: format-check
🔇 Additional comments (1)
setup.py (1)
108-111: Linux-only accelerator requirement is fine.The guard correctly avoids enforcing CUDA/ROCm on macOS/Metal builds.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
.github/workflows/ci.yml (2)
22-25: Upgrade to setup-python v5.
actions/setup-python@v2is deprecated and flagged by actionlint.- uses: actions/setup-python@v2 + uses: actions/setup-python@v5Also applies to: 78-81
40-45: Remove invalidpip install --no-user.
--no-useris not a valid pip flag (inside venvs it’s unnecessary anyway).- python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip - [[ -f requirements-test.txt ]] && \ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user + [[ -f requirements-test.txt ]] && \ + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt - pip install flash_attn==2.5.8 --no-user --no-build-isolation + pip install flash_attn==2.5.8 --no-build-isolation - pip install . --no-user + pip install . - pip install . --no-user -v + pip install . -vAlso applies to: 96-101, 107-107
.github/workflows/amd_ci.yml (1)
23-26: Upgrade to setup-python v5.v2 is deprecated; actionlint will flag this.
- uses: actions/setup-python@v2 + uses: actions/setup-python@v5Also applies to: 77-80
♻️ Duplicate comments (3)
.github/workflows/metal_ci.yml (3)
21-25: Update setup-python to v5.Same as earlier bot comment; apply in both jobs.
- - name: Set up Python - uses: actions/setup-python@v2 + - name: Set up Python + uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }}Also applies to: 76-79
29-31: Fix hashing on macOS (nosha256sum).Use
shasum -a 256(portable on macOS). Apply to both blocks.- REQS_HASH=$(sha256sum requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements") + REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")- REQS_HASH=$(cat requirements-test.txt 2>/dev/null || true) + REQS_HASH=$(shasum -a 256 requirements-test.txt 2>/dev/null | awk '{print $1}' || echo "no_requirements")Also applies to: 83-85
40-44: Remove invalidpip install --no-user.Not a valid flag and unnecessary in venvs. Apply in all occurrences.
- python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip - [[ -f requirements-test.txt ]] && \ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user + [[ -f requirements-test.txt ]] && \ + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt - python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip - [[ -f requirements-test.txt ]] && \ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user + [[ -f requirements-test.txt ]] && \ + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt - pip install . --no-user -v + pip install . -vAlso applies to: 93-96, 102-102
🧹 Nitpick comments (4)
.github/workflows/ci.yml (1)
12-14: Tighten permissions.Format check doesn’t push commits; prefer read-only.
- permissions: - contents: write + permissions: + contents: read.github/workflows/metal_ci.yml (3)
13-13: Trim trailing space.YAMLlint flags trailing spaces.
- contents: write + contents: write
10-10: Target Apple Silicon explicitly for MPS.To avoid Intel mac runners without MPS, prefer macOS 14 (Apple Silicon).
- runs-on: [macos-latest] + runs-on: [macos-14]Please confirm your org has access to macOS-14 ARM runners; otherwise keep latest and skip MPS tests when unavailable.
Also applies to: 63-63
104-109: Add a quick MPS sanity check before tests.Fail fast if MPS isn’t available.
- name: Run metal test run: | source "${{ runner.tool_cache }}/${{ env.VENV_DIR }}/bin/activate" + python - <<'PY' +import sys +try: + import torch + print("torch:", torch.__version__, "MPS available:", torch.backends.mps.is_available()) + assert torch.backends.mps.is_available(), "MPS not available on this runner" +except Exception as e: + print("Precheck failed:", e, file=sys.stderr) + sys.exit(1) +PY cd testing/python unset PYTHONPATH python -m pytest -k metal -v -r fE --durations=0 --timeout=3600
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 5b306179cb8b24f66118ff9847bad7a1847e82ff and 65ffe111a4df5414ce44a0a14c5113c6491e35c8.
📒 Files selected for processing (3)
.github/workflows/amd_ci.yml(1 hunks).github/workflows/ci.yml(1 hunks).github/workflows/metal_ci.yml(1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
22-22: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
76-76: the runner of "actions/setup-python@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 13-13: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-test-metal
| git submodule update --init --recursive --recommend-shallow | ||
| mkdir -p build | ||
| cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd .. |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Guard --recommend-shallow for older Git on macOS.
Add a fallback to keep runners with older Git working.
- git submodule update --init --recursive --recommend-shallow
+ if git submodule update -h 2>&1 | grep -q -- '--recommend-shallow'; then
+ git submodule update --init --recursive --recommend-shallow
+ else
+ git submodule update --init --recursive
+ fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| git submodule update --init --recursive --recommend-shallow | |
| mkdir -p build | |
| cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd .. | |
| if git submodule update -h 2>&1 | grep -q -- '--recommend-shallow'; then | |
| git submodule update --init --recursive --recommend-shallow | |
| else | |
| git submodule update --init --recursive | |
| fi | |
| mkdir -p build | |
| cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd .. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
.github/workflows/metal_ci.yml (5)
16-16: Remove trailing whitespace (yamllint violation).Line 16 has a trailing space after "write".
- contents: write + contents: write
40-50: Make format-check fail deterministically on formatting drift.Currently you fail only when format.sh exits non‑zero; if the script exits 0 but emits the sentinel text, the job passes. Harden the logic.
- if ! output=$(./format.sh 2>&1); then - echo "------------------------------------" - echo "message:" - echo "$output" - printf '%s\n' "$output" | grep "Please review and stage the changes." - echo "------------------------------------" - exit 1 - fi + set -euo pipefail + status=0 + output="$(./format.sh 2>&1)" || status=$? + if [ "$status" -ne 0 ] || printf '%s\n' "$output" | grep -q "Please review and stage the changes."; then + echo "------------------------------------" + echo "message:" + printf '%s\n' "$output" + echo "------------------------------------" + exit 1 + fi
13-13: Pin to Apple Silicon runners for MPS/Metal determinism.To ensure MPS availability and consistent behavior, prefer macOS 14 (Apple Silicon) explicitly rather than the moving alias macos-latest.
- runs-on: [macos-latest] + runs-on: macos-14Apply to both jobs after confirming that your Metal/MPS tests require Apple Silicon on GitHub-hosted runners.
Also applies to: 54-54
15-17: Use least-privilege permissions (write not needed).This job doesn’t push changes; read is sufficient.
- permissions: - contents: write + permissions: + contents: read
82-86: Use working-directory instead of cd; keep PYTHONPATH unset.Cleaner, avoids subshell state and makes logs clearer.
- - name: Run metal test - run: | - cd testing/python - unset PYTHONPATH - python -m pytest -k metal -v -r fE --durations=0 --timeout=3600 + - name: Run metal test + working-directory: testing/python + run: | + unset PYTHONPATH + python -m pytest -k metal -v -r fE --durations=0 --timeout=3600
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 65ffe111a4df5414ce44a0a14c5113c6491e35c8 and ca6546ac34e89cde32673641f20a0aca904533c8.
📒 Files selected for processing (1)
.github/workflows/metal_ci.yml(1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
28-28: property "os" is not defined in object type {}
(expression)
68-68: property "os" is not defined in object type {}
(expression)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 16-16: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: format-check
- GitHub Check: bot-task
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
.github/workflows/metal_ci.yml (1)
63-63: Fix invalid cache key reference to matrix.os (actionlint failure).Replace undefined matrix.os with runner.os and include runner.arch for uniqueness.
- key: ${{ github.job }}-${{ matrix.os }} + key: ${{ github.job }}-${{ runner.os }}-${{ runner.arch }}
🧹 Nitpick comments (5)
.github/workflows/metal_ci.yml (5)
16-16: Remove trailing space (yamllint).- contents: write + contents: write
15-17: Scope down permissions to least privilege.This job doesn’t appear to need write. Prefer read.
permissions: - contents: write + contents: read
33-46: Make format check fail on detected changes even if script exits 0.Currently failure only triggers when format.sh returns non-zero. Also fail on the marker text and on any git diff.
- name: Run format check run: | - source .venv/bin/activate - mkdir -p build - cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON; cd .. - if ! output=$(./format.sh 2>&1); then - echo "------------------------------------" - echo "message:" - echo "$output" - printf '%s\n' "$output" | grep "Please review and stage the changes." - echo "------------------------------------" - exit 1 - fi - rm -rf build + set -euo pipefail + source .venv/bin/activate + mkdir -p build + (cd build && cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_METAL=ON) + output=$(./format.sh 2>&1) || true + echo "------------------------------------" + echo "$output" + echo "------------------------------------" + if printf '%s\n' "$output" | grep -q "Please review and stage the changes."; then + exit 1 + fi + if ! git diff --quiet; then + git --no-pager diff + exit 1 + fi + rm -rf build
13-13: Consider pinning to macos-14 for stable Apple Silicon/MPS runners.Reduces surprises from future macos-latest changes.
- runs-on: [macos-latest] + runs-on: macos-14Also applies to: 49-49
77-81: Optional: print ccache stats after tests.Quick visibility into cache effectiveness.
- name: Run metal test run: | cd testing/python unset PYTHONPATH python -m pytest -k metal -v -r fE --durations=0 --timeout=3600 + + - name: ccache stats + run: ccache -s || true
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between ca6546ac34e89cde32673641f20a0aca904533c8 and e9cd15f744d1cb61534c907cad248305982953e0.
📒 Files selected for processing (1)
.github/workflows/metal_ci.yml(1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
63-63: property "os" is not defined in object type {}
(expression)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 16-16: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
9f7474c to
98336a3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
.github/workflows/metal_ci.yml (1)
64-64: Fix cache key: matrix.os is undefined in this workflow
Usingmatrix.oswithout a matrix breaks actionlint and can affect cache keys. Use runner context.- key: ${{ github.job }}-${{ matrix.os }} + key: ${{ github.job }}-${{ runner.os }}-${{ runner.arch }}
🧹 Nitpick comments (3)
.github/workflows/metal_ci.yml (3)
14-14: Trim trailing whitespace (yamllint error)There’s a trailing space after
writeflagged by YAMLlint.- contents: write + contents: write
11-11: Pin macOS runner to Apple Silicon image for stable MPS/Metal
macos-latestcan shift. Pin tomacos-14(or your validated version) to ensure consistent MPS/Metal availability.- runs-on: [macos-latest] + runs-on: macos-14(Apply to both jobs.)
Also applies to: 47-47
13-15: Restrict permissions to least privilege
This job doesn’t push changes;contents: readsuffices.- permissions: - contents: write + permissions: + contents: read
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between e9cd15f744d1cb61534c907cad248305982953e0 and 98336a39372308bb8964379727b6155dc665cdb3.
📒 Files selected for processing (2)
.github/workflows/metal_ci.yml(1 hunks)format.sh(1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
64-64: property "os" is not defined in object type {}
(expression)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 14-14: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: format-check
- GitHub Check: bot-task
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
.github/workflows/ci.yml (3)
124-125: Stabilize Metal job: pin Apple Silicon runner and preflight MPS availability.To avoid drift if
macos-latestchanges, pin to an Apple Silicon image (e.g., macos-14). Add a quick MPS check to fail fast with a clear message if MPS isn’t available.Apply:
- runs-on: [macos-latest] + runs-on: macos-14Insert before tests:
+ - name: Verify MPS availability + run: | + python - <<'PY' + import sys + try: + import torch + ok = getattr(torch.backends, "mps", None) and torch.backends.mps.is_available() + except Exception: + ok = False + print(f"MPS available: {ok}") + sys.exit(0 if ok else 1) + PYAlso applies to: 156-160
133-136: Improve checkout completeness for submodules on macOS.Ensure the whole history is available when submodules/versions require it.
Apply:
with: submodules: recursive + fetch-depth: 0
12-14: Least-privilege: format-check likely only needs read permission.Unless you auto-commit formatting, drop to
contents: read.Apply:
- permissions: - contents: write + permissions: + contents: read
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 98336a39372308bb8964379727b6155dc665cdb3 and fd8a47bf287afd3a2435815e3b9370d4e2ec4d49.
📒 Files selected for processing (1)
.github/workflows/ci.yml(2 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/ci.yml
141-141: property "os" is not defined in object type {}
(expression)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: format-check
- GitHub Check: bot-task
| @@ -7,7 +7,7 @@ env: | |||
|
|
|||
| jobs: | |||
| format-check: | |||
| runs-on: [self-hosted, nvidia] | |||
| runs-on: [ubuntu-latest] | |||
There was a problem hiding this comment.
Make format-check runner GPU-agnostic: drop flash-attn build and don’t force CUDA in CMake.
On ubuntu-latest, building flash_attn commonly fails (no CUDA toolchain). Also forcing -DUSE_CUDA=ON can make CMake error out on non-GPU runners. These can break formatting CI unrelated to GPU.
Apply:
- pip install flash_attn==2.5.8 --no-user --no-build-isolation
+ # GPU-specific deps are not needed for format check; avoid fragile builds
+ true- cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=ON; cd ..
+ # Generate compile_commands.json without assuming CUDA
+ cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=OFF; cd ..Also applies to: 41-45, 53-54
🤖 Prompt for AI Agents
.github/workflows/ci.yml around line 10 (and also update the same changes at
lines 41-45 and 53-54): the format-check runner is currently tied to GPU tooling
— remove the step that builds flash-attn and stop forcing CUDA in CMake for this
runner; update the job matrix/steps so the format-check job does not invoke
building flash-attn and remove or conditionally set any -DUSE_CUDA=ON CMake
flags (only enable CUDA on GPU-capable jobs), ensuring the workflow uses
CPU-only defaults so formatting CI will run on ubuntu-latest without requiring a
CUDA toolchain.
| uses: hendrikmuhs/ccache-action@v1.2 | ||
| with: | ||
| create-symlink: true | ||
| key: ${{ github.job }}-${{ matrix.os }} | ||
|
|
There was a problem hiding this comment.
Fix ccache key: matrix.os is undefined (will fail at runtime).
matrix.os isn’t defined in this job (no matrix), which actionlint already flags. Use runner.os and include Python/version and requirements hash to improve cache hits.
Apply:
- with:
- create-symlink: true
- key: ${{ github.job }}-${{ matrix.os }}
+ with:
+ create-symlink: true
+ key: ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-py${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements-*.txt') }}
+ restore-keys: |
+ ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| uses: hendrikmuhs/ccache-action@v1.2 | |
| with: | |
| create-symlink: true | |
| key: ${{ github.job }}-${{ matrix.os }} | |
| uses: hendrikmuhs/ccache-action@v1.2 | |
| with: | |
| create-symlink: true | |
| key: ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-py${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements-*.txt') }} | |
| restore-keys: | | |
| ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}- |
🧰 Tools
🪛 actionlint (1.7.7)
141-141: property "os" is not defined in object type {}
(expression)
🤖 Prompt for AI Agents
In .github/workflows/ci.yml around lines 138 to 142 the ccache key uses
matrix.os which is undefined and will fail; replace matrix.os with runner.os and
expand the key to include stable identifiers like the Python version and a hash
of requirements (or lockfile) to improve cache correctness and hit rate — e.g.
construct the key as a combination of github.job, runner.os, the python-version
variable, and a requirements.txt (or poetry/poetry.lock) hash so the cache is
unique per OS/Python/deps and will not error at runtime.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/amd_ci.yml (1)
64-120: Remove invalid pip flag and modernize Python setup
- pip --no-user is invalid and will fail; remove it (both install locations).
- Update actions/setup-python to v5 in this job too.
- - name: Set up Python - uses: actions/setup-python@v2 + - name: Set up Python + uses: actions/setup-python@v5 @@ - USE_ROCM=True pip install . --no-user + USE_ROCM=True pip install . @@ - USE_ROCM=True pip install . --no-user + USE_ROCM=True pip install ..github/workflows/metal_ci.yml (1)
88-94: Install the built wheel explicitlyAvoid picking up sdists (which could rebuild without USE_METAL). Install only the wheel.
- uv pip install dist/* + uv pip install dist/*.whl
♻️ Duplicate comments (2)
.github/workflows/metal_ci.yml (2)
82-87: Don’t source .venv with setup-uv; ensure Metal build flag setsetup-uv already activates the env;
.venv/bin/activatemay not exist and will fail. Also ensure Metal is enabled during build.- source .venv/bin/activate - uv pip install build wheel ninja cmake - python -mbuild -w --no-isolation + uv pip install build wheel ninja cmake + USE_METAL=1 python -m build -w --no-isolation
65-70: Fix undefined matrix.os in ccache key (actionlint)matrix.os isn’t defined; use runner context to keep keys valid and unique.
- key: ${{ github.job }}-${{ matrix.os }} + key: ${{ github.job }}-${{ runner.os }}-${{ runner.arch }}
🧹 Nitpick comments (4)
setup.py (1)
102-106: Improve error message clarity.The error message for missing CUDA_HOME uses a backslash for line continuation which could be confusing. Consider using parentheses for cleaner multi-line strings.
Apply this diff to improve readability:
- "CUDA support is enabled by default on linux if `USE_ROCM=False`," \ - " but CUDA_HOME is not set or detected.") + ("CUDA support is enabled by default on linux if `USE_ROCM=False`, " + "but CUDA_HOME is not set or detected.")).github/workflows/amd_ci.yml (1)
9-62: Gating with if: false — consider a toggle and update setup-python
- Replace hardcoded if: false with a single env/input (e.g., ENABLE_AMD_CI) so both jobs can be enabled together without editing YAML.
- Update actions/setup-python from v2 to v5 to avoid deprecation issues.
- format-check: - if: false + format-check: + if: ${{ env.ENABLE_AMD_CI == 'true' }} @@ - - name: Set up Python - uses: actions/setup-python@v2 + - name: Set up Python + uses: actions/setup-python@v5.github/workflows/metal_ci.yml (2)
10-11: Pin runner to Apple SiliconMetal/MPS tests may require Apple Silicon. Prefer macos-14 (arm64) over macos-latest to avoid Intel runners.
- runs-on: [macos-latest] + runs-on: macos-14
12-14: Tighten permissions and fix trailing spaceUse least-privilege and remove trailing space flagged by yamllint.
- permissions: - contents: write + permissions: + contents: read
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between fd8a47bf287afd3a2435815e3b9370d4e2ec4d49 and 32edb2446da0b9e707397767677eb68b032d9962.
📒 Files selected for processing (4)
.github/workflows/amd_ci.yml(2 hunks).github/workflows/ci.yml(2 hunks).github/workflows/metal_ci.yml(1 hunks)setup.py(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/ci.yml
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
69-69: property "os" is not defined in object type {}
(expression)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 13-13: trailing spaces
(trailing-spaces)
🪛 Ruff (0.12.2)
setup.py
103-105: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-metal
- GitHub Check: bot-task
🔇 Additional comments (9)
setup.py (8)
36-43: LGTM! Boolean environment parsing is now more robust.The updated
_read_bool_envfunction correctly handles both truthy and falsy values explicitly, including empty strings as false. This prevents the previous issue where explicit false values like "0" or "false" would incorrectly return the default.
65-81: LGTM! Platform-specific library naming is correctly handled.The prebuild items lists are properly differentiated between Linux (
.so) and macOS (.dylib) shared libraries, with appropriate deletion items for macOS debug symbols.
614-622: LGTM! CMakeExtension constructor is properly updated.The addition of
**kwargsparameter allows for flexible extension configuration while maintaining backward compatibility.
669-669: LGTM! Platform-specific file globbing is correctly implemented.The dynamic selection between
.soand.dylibfiles based on the platform is appropriate for cross-platform compatibility.
700-710: Document Metal-specific Cython compilation issue.The comment explains the issue with Cython compilation on Metal/macOS requiring libpython linkage. This is a reasonable workaround until the issue is investigated further.
761-763: Document the macOS Cython linking issue.The FIXME comment correctly identifies that aarch64 Darwin needs additional Python library linking. This is consistent with the earlier Metal-specific handling.
822-857: LGTM! CMake configuration logic properly handles Metal backend.The Metal-first configuration approach is well-implemented:
- Metal takes precedence when
USE_METALis enabled- Proper fallback to ROCM/CUDA backends
- Correct mutual exclusion of backends
487-494: LGTM! Cleanup logic for platform-specific artifacts is properly implemented.The loop handles deletion of macOS-specific build artifacts (like
.dSYMfiles) that need to be removed during the build process..github/workflows/metal_ci.yml (1)
35-49: Format step looks goodCMake prep with USE_METAL=ON and clean failure output are clear.
| IS_LINUX = platform.system() == 'Linux' | ||
| MAYBE_METAL = platform.mac_ver()[2] == 'arm64' | ||
|
|
||
| # Add LLVM control environment variable | ||
| USE_LLVM = os.environ.get("USE_LLVM", "False").lower() == "true" | ||
| USE_LLVM = _read_bool_env('USE_LLVM') | ||
| # Add ROCM control environment variable | ||
| USE_ROCM = _read_bool_env("USE_ROCM") | ||
| # Add ROCM control environment variable | ||
| USE_METAL = _read_bool_env("USE_METAL", MAYBE_METAL) | ||
| # Add ROCM control environment variable | ||
| USE_ROCM = os.environ.get("USE_ROCM", "False").lower() == "true" | ||
| USE_CUDA = _read_bool_env("USE_CUDA", IS_LINUX and not USE_ROCM) | ||
| # Build with Debug mode |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Verify Metal availability detection logic.
The MAYBE_METAL variable checks for arm64 architecture on macOS, but this may not be sufficient for determining Metal support availability. Consider if additional checks are needed for Metal framework availability.
🏁 Script executed:
#!/bin/bash
# Description: Check if Metal detection logic in other files is consistent with setup.py
# Search for Metal availability checks in the codebase
rg -n "check_metal_availability|metal.*availability" --type=py -A3 -B3
# Look for platform-specific Metal checks
rg -n "mac_ver|arm64|MAYBE_METAL" --type=py -A2 -B2Length of output: 2480
Replace arch-only Metal detection with feature-based detection
Both setup.py (MAYBE_METAL = platform.mac_ver()[2] == 'arm64') and tilelang/utils/target.py (check_metal_availability returning arch == 'arm64') assume Metal only on arm64 and thus exclude Intel Macs that may support Metal. Replace these ad-hoc checks with a single helper that verifies macOS release and Metal framework availability (e.g., presence of /System/Library/Frameworks/Metal.framework or attempting to load the framework) and use it both in setup.py and target selection.
Locations: setup.py (MAYBE_METAL, ~lines 51–62); tilelang/utils/target.py (check_metal_availability, ~lines 46–51; used at ~line 84).
🤖 Prompt for AI Agents
In setup.py around lines 51–62 and tilelang/utils/target.py around lines 46–51
and usage at ~84, replace the arch-only Metal detection with a feature-based
helper: add a cross-file helper is_metal_available() (in a shared util module or
a small platform_helpers.py) that returns True only if platform.system() ==
'Darwin', macOS release is new enough (parse platform.mac_ver()[0] or
platform.release()), and the Metal framework is present (check existence of
/System/Library/Frameworks/Metal.framework or attempt to load it via
ctypes.CDLL); then change MAYBE_METAL in setup.py to call is_metal_available()
instead of checking arm64, and replace check_metal_availability / arch ==
'arm64' in tilelang/utils/target.py to call this new helper so both files use
the same robust feature-detection logic.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
.github/workflows/amd_ci.yml (2)
41-45: Remove invalid pip flag--no-user.
pip install --no-userisn’t a valid option and will fail.- python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip - [[ -f requirements-test.txt ]] && \ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user + [[ -f requirements-test.txt ]] && \ + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt
98-105: Remove--no-userin remaining pip installs.- python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip @@ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-rocm.txt + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-rocm.txt @@ - USE_ROCM=True pip install . --no-user + USE_ROCM=True pip install . @@ - USE_ROCM=True pip install . --no-user + USE_ROCM=True pip install .Also applies to: 112-112
setup.py (2)
271-279:get_cmake_pathcan raiseTypeErrorwhen CMake is missing.Guard
Nonebeforeos.path.exists.def get_cmake_path() -> str: @@ - cmake_path = shutil.which("cmake") - if not os.path.exists(cmake_path): + cmake_path = shutil.which("cmake") + if not cmake_path or not os.path.exists(cmake_path): raise Exception("CMake is not installed, please install it first.") return cmake_path
843-857: HonorUSE_CUDA=Falseand explicitly disable CUDA when using Metal.Current logic enables CUDA if
CUDA_HOMEis set even whenUSE_CUDA=0, and Metal path doesn’t forceUSE_CUDA OFF.- if USE_METAL: + if USE_METAL: content_lines += [ "set(USE_METAL ON)", "set(USE_ROCM OFF)", + "set(USE_CUDA OFF)", ] - elif USE_ROCM: + elif USE_ROCM: content_lines += [ f"set(USE_ROCM {ROCM_HOME})", "set(USE_CUDA OFF)", ] - elif CUDA_HOME: + elif USE_CUDA and CUDA_HOME: content_lines += [ f"set(USE_CUDA {CUDA_HOME})", "set(USE_ROCM OFF)", ]
♻️ Duplicate comments (4)
.github/workflows/ci.yml (1)
41-45: Remove invalid--no-userand avoid fragile GPU deps in format-check.- python -m pip install --upgrade pip --no-user + python -m pip install --upgrade pip - [[ -f requirements-test.txt ]] && \ - PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt --no-user + [[ -f requirements-test.txt ]] && \ + PIP_NO_BUILD_ISOLATION=1 pip install -r requirements-test.txt - pip install flash_attn==2.5.8 --no-user --no-build-isolation + # GPU-specific deps not needed for formatting + true @@ - pip install flash_attn==2.5.8 --no-user --no-build-isolation - pip install . --no-user + # Skip GPU-only deps here + pip install . @@ - pip install . --no-user -v + pip install . -v- cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=ON; cd .. + # Do not force CUDA for formatting + cd build; cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DUSE_CUDA=OFF; cd ..Also applies to: 101-103, 109-109
.github/workflows/metal_ci.yml (2)
84-86: Don’t source.venvwhen usingsetup-uvwith activation.The UV action already activates the environment;
.venvmay not exist.- source .venv/bin/activate - uv pip install -v --no-build-isolation . + uv pip install -v --no-build-isolation .
69-69: Fix cache key:matrix.osis undefined (actionlint error).Use
runnercontext to avoid runtime failure.- key: ${{ github.job }}-${{ matrix.os }} + key: ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-${{ runner.arch }}-py${{ env.PYTHON_VERSION }} + restore-keys: | + ${{ github.workflow }}-${{ github.job }}-${{ runner.os }}-setup.py (1)
36-44: Boolean env parser misinterprets explicit false values (and ignores empty string).
if env := os.environ.get(name):skips'';elif env in ['', ...]is unreachable. This can enable backends unintentionally.-def _read_bool_env(name: str, default: bool = False) -> bool: - if env := os.environ.get(name): - env = env.lower() - if env in ['on', '1', 'true']: - return True - elif env in ['', 'off', '0', 'false']: - return False - return default +def _read_bool_env(name: str, default: bool = False) -> bool: + val = os.environ.get(name) + if val is None: + return default + s = val.strip().lower() + if s in ("1", "true", "on", "yes", "y"): + return True + if s in ("0", "false", "off", "no", "n", ""): + return False + return default
🧹 Nitpick comments (13)
.github/workflows/amd_ci.yml (4)
11-11: Don’t hard-disable AMD CI; gate via repo var or input.Replace
if: falsewith a controllable switch so maintainers can re-enable without code changes.- if: false + if: ${{ vars.DISABLE_AMD_CI != '1' }}
65-65: Same here: avoid permanent disable.Use a variable-gated condition instead of
if: false.- if: false + if: ${{ vars.DISABLE_AMD_CI != '1' }}
24-26: Upgrade setup-python to v5.
actions/setup-python@v2is deprecated.- uses: actions/setup-python@v2 + uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }}Also applies to: 79-81
15-15: Trim trailing whitespace in permissions.- contents: write + contents: write.github/workflows/ci.yml (4)
10-10: Avoid permanent disable; gate via a repo variable.- if: false + if: ${{ vars.DISABLE_NVIDIA_CI != '1' }}
66-66: Same for build-test-nvidia: use a toggle instead ofif: false.- if: false + if: ${{ vars.DISABLE_NVIDIA_CI != '1' }}
23-25: Update toactions/setup-python@v5.- uses: actions/setup-python@v2 + uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }}Also applies to: 80-82
14-14: Trailing space aftercontents: write.- contents: write + contents: write.github/workflows/metal_ci.yml (2)
13-13: Trim trailing whitespace.- contents: write + contents: write
40-40: Consider generator parity with setup.py.If the build expects Ninja elsewhere, specify
-G Ninjahere too for consistency, or rely on default uniformly.setup.py (3)
51-53: Metal detection: preferplatform.system()/platform.machine().
platform.mac_ver()is brittle; simplify and be explicit.-IS_LINUX = platform.system() == 'Linux' -MAYBE_METAL = platform.mac_ver()[2] == 'arm64' +IS_LINUX = platform.system() == "Linux" +MAYBE_METAL = (platform.system() == "Darwin" and platform.machine() == "arm64")
900-903: Classifiers/platforms: include macOS when Metal builds are possible.- platforms=[ - "Environment :: GPU :: NVIDIA CUDA" if not USE_ROCM else "Environment :: GPU :: AMD ROCm", - "Operating System :: POSIX :: Linux", - ], + platforms=[ + ("Environment :: GPU :: NVIDIA CUDA" if (USE_CUDA and not USE_ROCM) else + "Environment :: GPU :: AMD ROCm" if USE_ROCM else + "Environment :: GPU :: Apple Metal"), + ("Operating System :: MacOS" if USE_METAL else "Operating System :: POSIX :: Linux"), + ],
103-106: Long exception message (TRY003).Consider shortening or using a constant; non-blocking.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between fd8a47bf287afd3a2435815e3b9370d4e2ec4d49 and 608fef209e5a99d4fe6a6cc6d1feb5b1abcf7092.
📒 Files selected for processing (4)
.github/workflows/amd_ci.yml(2 hunks).github/workflows/ci.yml(2 hunks).github/workflows/metal_ci.yml(1 hunks)setup.py(11 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
69-69: property "os" is not defined in object type {}
(expression)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 13-13: trailing spaces
(trailing-spaces)
🪛 Ruff (0.12.2)
setup.py
103-105: Avoid specifying long messages outside the exception class
(TRY003)
| python-version: ${{ env.PYTHON_VERSION }} | ||
|
|
||
| - name: Ensure venv (local & persistent) | ||
| run: uv pip install -r requirements-test.txt -r requirements-build.txt |
There was a problem hiding this comment.
Guard optional requirements to avoid failures if the file is missing.
- run: uv pip install -r requirements-test.txt -r requirements-build.txt
+ run: |
+ set -e
+ [[ -f requirements-test.txt ]] && uv pip install -r requirements-test.txt
+ [[ -f requirements-build.txt ]] && uv pip install -r requirements-build.txt📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| run: uv pip install -r requirements-test.txt -r requirements-build.txt | |
| run: | | |
| set -e | |
| [[ -f requirements-test.txt ]] && uv pip install -r requirements-test.txt | |
| [[ -f requirements-build.txt ]] && uv pip install -r requirements-build.txt |
🤖 Prompt for AI Agents
In .github/workflows/metal_ci.yml around line 80, the step currently tries to
install optional requirement files unconditionally which will fail the job if
those files are missing; change the run command to detect each optional
requirements file (e.g., requirements-test.txt and requirements-build.txt) and
only include them in the pip install command if they exist (use a shell
conditional to build the -r arguments or test -f before calling pip), ensuring
that missing files are skipped but real install errors still fail the job.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
setup.py (1)
271-279: get_cmake_path can raise TypeError when cmake is missingshutil.which() returns None; os.path.exists(None) throws. Guard None explicitly.
def get_cmake_path() -> str: @@ - cmake_path = shutil.which("cmake") - if not os.path.exists(cmake_path): - raise Exception("CMake is not installed, please install it first.") - return cmake_path + cmake_path = shutil.which("cmake") + if not cmake_path or not os.path.exists(cmake_path): + raise RuntimeError("CMake is not installed or not on PATH.") + return cmake_path
♻️ Duplicate comments (5)
.github/workflows/metal_ci.yml (3)
84-85: Don’t source .venv with setup-uv; it’s already activated.setup-uv with activate-environment: true exports the env; .venv may not exist.
- source .venv/bin/activate - uv pip install -v --no-build-isolation . + uv pip install -v --no-build-isolation .
69-69: Fix actionlint error: undefined matrix.os in cache keyNo matrix is defined; this breaks actionlint and can impact cache reuse.
- key: ${{ github.job }}-${{ matrix.os }} + key: ${{ github.job }}-${{ runner.os }}-${{ runner.arch }}
79-81: Guard optional requirement filesUnconditionally installing both files fails if one is absent.
- run: uv pip install -r requirements-test.txt -r requirements-build.txt + run: | + set -euo pipefail + [[ -f requirements-test.txt ]] && uv pip install -r requirements-test.txt + [[ -f requirements-build.txt ]] && uv pip install -r requirements-build.txtsetup.py (2)
51-62: Prefer feature-based Metal detection over arch-only heuristicIntel Macs can support Metal; arch-only check excludes them. Centralize detection (Darwin + Metal.framework presence) and reuse.
-IS_LINUX = platform.system() == 'Linux' -MAYBE_METAL = platform.mac_ver()[2] == 'arm64' +IS_LINUX = platform.system() == 'Linux' + +def _has_metal() -> bool: + if platform.system() != 'Darwin': + return False + return os.path.exists('/System/Library/Frameworks/Metal.framework') + +MAYBE_METAL = _has_metal()
36-43: Boolean env parsing bug: empty string never reaches the “false” branchThe truthy guard prevents '' from being treated as false; explicit “0/false/off/''” can incorrectly return default.
-def _read_bool_env(name: str, default: bool = False) -> bool: - if env := os.environ.get(name): - env = env.lower() - if env in ['on', '1', 'true']: - return True - elif env in ['', 'off', '0', 'false']: - return False - return default +def _read_bool_env(name: str, default: bool = False) -> bool: + val = os.environ.get(name) + if val is None: + return default + s = val.strip().lower() + if s in ('1', 'true', 'on', 'yes', 'y'): + return True + if s in ('0', 'false', 'off', 'no', 'n', ''): + return False + return default
🧹 Nitpick comments (4)
.github/workflows/metal_ci.yml (2)
12-14: Drop unnecessary write permission and fix trailing spaceThis job doesn't publish artifacts; reduce to read-only and remove trailing space on Line 13.
- permissions: - contents: write + permissions: + contents: read
71-78: Ensure Ninja is present or fall back to default generatorCMake is invoked with “-G Ninja” in setup.py; missing Ninja will fail on some mac runners.
- name: Install python via uv uses: astral-sh/setup-uv@v6 with: enable-cache: true ignore-nothing-to-cache: true activate-environment: true python-version: ${{ env.PYTHON_VERSION }} + + - name: Ensure Ninja + run: | + if ! command -v ninja >/dev/null 2>&1; then + brew update && brew install ninja + fiOptionally, make setup.py fall back to default generator if Ninja is absent.
setup.py (2)
823-825: Prefer Python3_EXECUTABLE for CMakeFindPython prefers Python3_EXECUTABLE; improves compatibility across CMake toolchains.
- f"-DPython_EXECUTABLE={sys.executable}", + f"-DPython3_EXECUTABLE={sys.executable}",
844-859: Always emit mutually exclusive GPU flagsWhen CUDA is selected, set USE_ROCM OFF; when ROCm is selected, set USE_CUDA OFF; when neither is selected, set both OFF explicitly to avoid stale cache effects.
if USE_METAL: content_lines += [ "set(USE_METAL ON)", - "set(USE_ROCM OFF)", + "set(USE_ROCM OFF)", + "set(USE_CUDA OFF)", ] elif USE_ROCM: content_lines += [ f"set(USE_ROCM {ROCM_HOME})", "set(USE_CUDA OFF)", ] elif CUDA_HOME: content_lines += [ f"set(USE_CUDA {CUDA_HOME})", "set(USE_ROCM OFF)", ] + else: + content_lines += [ + "set(USE_CUDA OFF)", + "set(USE_ROCM OFF)", + ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 608fef209e5a99d4fe6a6cc6d1feb5b1abcf7092 and 4c42d7eea70d878a010888bc4944585c059109e7.
📒 Files selected for processing (2)
.github/workflows/metal_ci.yml(1 hunks)setup.py(11 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/metal_ci.yml
69-69: property "os" is not defined in object type {}
(expression)
🪛 YAMLlint (1.37.1)
.github/workflows/metal_ci.yml
[error] 13-13: trailing spaces
(trailing-spaces)
🪛 Ruff (0.12.2)
setup.py
103-105: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-metal
- GitHub Check: build-test-amd
- GitHub Check: bot-task
🔇 Additional comments (2)
setup.py (2)
102-111: Linux-only GPU requirement may be too strictThis hard-errors on Linux without CUDA/ROCM even for CPU-only users or CI. Confirm this is intentional; otherwise downgrade to warning or gate behind env.
If relaxing is desired:
-if IS_LINUX and not (CUDA_HOME or ROCM_HOME): - raise ValueError( - "Failed to automatically detect CUDA or ROCM installation. Please set the CUDA_HOME or ROCM_HOME environment variable manually (e.g., export CUDA_HOME=/usr/local/cuda or export ROCM_HOME=/opt/rocm)." - ) +if IS_LINUX and not (CUDA_HOME or ROCM_HOME): + logger.warning("No CUDA/ROCM detected; building CPU-only components.")
700-710: Remove CythonExtension from ext_modules when MAYBE_METAL is truesetup.py returns early for Metal without producing the Cython artifact, but ext_modules still declares CythonExtension — this can break packaging/editable installs.
File: setup.py (around lines 700–710)
- ext_modules=[ - CMakeExtension("TileLangCXX", sourcedir="."), - CythonExtension("TileLangCython", sourcedir="."), - ], + ext_modules=[ + CMakeExtension("TileLangCXX", sourcedir="."), + ] + ([] if MAYBE_METAL else [CythonExtension("TileLangCython", sourcedir=".")]),Alternative: keep the extension but provide a stub .so or mark the extension optional so build_ext is satisfied.
| # fixme: aarch64 darwin needs something like `-Lxxx -lpython3.12` | ||
| command = f"{cc} -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}" | ||
| # logger.info(command) | ||
| os.system(command) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Fix Darwin link flags for Cython build (non-Metal macs)
Using “-shared” on macOS is incorrect; use -dynamiclib and allow undefined symbols to Python.
- command = f"{cc} -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}"
- # logger.info(command)
- os.system(command)
+ if sys.platform == "darwin":
+ ldflags = "-dynamiclib -undefined dynamic_lookup"
+ else:
+ ldflags = "-shared -pthread -fPIC"
+ command = f"{cc} {ldflags} -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}"
+ logger.info(command)
+ subprocess.check_call(command, shell=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # fixme: aarch64 darwin needs something like `-Lxxx -lpython3.12` | |
| command = f"{cc} -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}" | |
| # logger.info(command) | |
| os.system(command) | |
| # fixme: aarch64 darwin needs something like `-Lxxx -lpython3.12` | |
| if sys.platform == "darwin": | |
| ldflags = "-dynamiclib -undefined dynamic_lookup" | |
| else: | |
| ldflags = "-shared -pthread -fPIC" | |
| command = f"{cc} {ldflags} -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}" | |
| logger.info(command) | |
| subprocess.check_call(command, shell=True) |
🧰 Tools
🪛 Ruff (0.12.2)
765-765: Starting a process with a shell, possible injection detected
(S605)
🤖 Prompt for AI Agents
In setup.py around lines 762-765, the build command uses the Linux-style
"-shared" which is incorrect on macOS; change the link flags for Darwin to use
"-dynamiclib" and allow undefined symbols to be resolved by Python by adding
"-undefined dynamic_lookup" (and keep other flags as-is or remove -fPIC if
desired for macOS), so on macOS construct the command with -dynamiclib and
-undefined dynamic_lookup instead of -shared before invoking os.system(command).
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
setup.py (2)
674-683: Copy macOS-built Python extension modules (.so) too; not just .dylib.On macOS, C extensions are typically .so. Include both patterns.
-import glob -files = glob.glob("*.so" if IS_LINUX else "*.dylib") +import glob +files = glob.glob("*.so") +if not IS_LINUX: + files += glob.glob("*.dylib")
839-854: CMake GPU toggles: explicitly disable unused backends and honor USE_CUDA flag.
- In Metal branch, also set USE_CUDA OFF.
- In CUDA branch, respect USE_CUDA rather than presence of CUDA_HOME alone.
- if USE_METAL: + if USE_METAL: content_lines += [ "set(USE_METAL ON)", "set(USE_ROCM OFF)", + "set(USE_CUDA OFF)", ] elif USE_ROCM: @@ - elif CUDA_HOME: + elif USE_CUDA and CUDA_HOME: content_lines += [ f"set(USE_CUDA {CUDA_HOME})", "set(USE_ROCM OFF)", ] + else: + content_lines += [ + "set(USE_METAL OFF)", + "set(USE_ROCM OFF)", + "set(USE_CUDA OFF)", + ]
♻️ Duplicate comments (2)
setup.py (2)
36-44: Boolean env parsing still mishandles empty strings and unrecognized values; treat true/false explicitly.Current truthiness gate skips empty strings; also missing yes/no aliases. Use explicit parsing.
-def _read_bool_env(name: str, default: bool = False) -> bool: - if env := os.environ.get(name): - env = env.lower() - if env in ['on', '1', 'true']: - return True - elif env in ['', 'off', '0', 'false']: - return False - return default +def _read_bool_env(name: str, default: bool = False) -> bool: + val = os.environ.get(name) + if val is None: + return default + s = val.strip().lower() + if s in ('1', 'true', 'on', 'yes', 'y'): + return True + if s in ('0', 'false', 'off', 'no', 'n', ''): + return False + return default
753-761: Fix Darwin link flags for Cython artifact and avoid shell-based os.system.Use -dynamiclib and -undefined dynamic_lookup on macOS; prefer subprocess.check_call without shell.
- cython = get_cython_compiler() + cython = get_cython_compiler() if cython is None: raise Exception("Cython is not installed, please install it first.") - os.system(f"{cython} {cython_wrapper_path} --cplus -o {source_path}") + subprocess.check_call([cython, str(cython_wrapper_path), "--cplus", "-o", str(source_path)]) python_include_path = sysconfig.get_path("include") cc = get_cplus_compiler() - if MAYBE_METAL: - cc += ' -Wl,-undefined,dynamic_lookup' - command = f"{cc} -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I{python_include_path} {source_path} -o {temp_path}" - logger.info(command) - os.system(command) + if sys.platform == "darwin": + cmd = [ + cc, "-dynamiclib", "-undefined", "dynamic_lookup", + "-fwrapv", "-O2", "-Wall", "-fno-strict-aliasing", + "-I", python_include_path, str(source_path), "-o", str(temp_path) + ] + else: + cmd = [ + cc, "-shared", "-pthread", "-fPIC", + "-fwrapv", "-O2", "-Wall", "-fno-strict-aliasing", + "-I", python_include_path, str(source_path), "-o", str(temp_path) + ] + logger.info(" ".join(cmd)) + subprocess.check_call(cmd)
🧹 Nitpick comments (3)
setup.py (3)
56-66: Refine Metal default detection; avoidplatform.mac_ver()[2]and gate by OS/arch/version.Arm64-only check is brittle. Prefer feature/OS checks and MPS minimum macOS.
-IS_LINUX = platform.system() == 'Linux' -MAYBE_METAL = platform.mac_ver()[2] == 'arm64' +IS_LINUX = platform.system() == 'Linux' +def _is_apple_silicon_macos(min_ver: str = "12.3") -> bool: + if platform.system() != "Darwin": + return False + try: + from packaging.version import Version + mac_ver = platform.mac_ver()[0] or "0" + return platform.machine().lower() in ("arm64", "aarch64") and Version(mac_ver) >= Version(min_ver) + except Exception: + return platform.machine().lower() in ("arm64", "aarch64") +MAYBE_METAL = _is_apple_silicon_macos()Also consider centralizing this helper (shared with target selection) to keep detection consistent across files.
920-930: Optional: expand classifiers/platforms for Metal/macOS builds.Consider adding macOS classifier and GPU backend markers reflecting Metal when enabled.
46-69: Optional: document env var precedence and defaults in README.Clarify how USE_CUDA/USE_ROCM/USE_METAL interact and defaults per platform to reduce surprise.
I can draft a short “Build matrix and env vars” section for README if helpful.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 4c42d7eea70d878a010888bc4944585c059109e7 and 7037ce81e29c74ac7a31a48b1e795aca5254e496.
📒 Files selected for processing (1)
setup.py(11 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
setup.py (1)
tilelang/env.py (1)
get(136-139)
🪛 Ruff (0.12.2)
setup.py
108-110: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-metal
- GitHub Check: format-check
- GitHub Check: format-check
🔇 Additional comments (3)
setup.py (3)
400-401: LGTM: Linux-only rpath patch guard.
70-86: LGTM: platform-specific prebuilt artifact lists and .dSYM cleanup.
619-623: LGTM: Extension kwargs passthrough for CMakeExtension.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
testing/python/metal/test_metal_codegen.py (1)
52-53: Tensor dimensions are correct for matrix multiplication.The tensor dimensions in the test are properly set up for matrix multiplication
A @ B:
ahas shape(M, K)bhas shape(K, N)chas shape(M, N)This correctly implements the mathematical operation C = A × B where the inner dimension K is consistent.
Also applies to: 55-57
🧹 Nitpick comments (2)
testing/python/metal/test_metal_codegen.py (2)
18-20: Consider memory alignment for shared memory allocations.The shared memory allocations for
A_sharedandB_sharedmay benefit from memory alignment optimizations to improve performance on Metal. Consider aligning the allocation sizes to powers of 2 or Metal's preferred alignment boundaries.
79-80: Add error message when MPS is not available.The test silently does nothing when MPS is not available. Consider adding an informative message or skip marker to clarify why tests aren't running.
if __name__ == "__main__": if torch.mps.is_available(): tilelang.testing.main() + else: + print("MPS not available. Skipping Metal backend tests.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 7037ce81e29c74ac7a31a48b1e795aca5254e496 and 5dafa7782e286ebec4ba8a41cc41060259260d34.
📒 Files selected for processing (1)
testing/python/metal/test_metal_codegen.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/metal/test_metal_codegen.py (6)
tilelang/jit/__init__.py (1)
jit(241-314)tilelang/language/allocate.py (2)
alloc_shared(21-36)alloc_fragment(53-64)tilelang/language/fill.py (1)
clear(24-48)tilelang/language/pipeline.py (1)
Pipelined(9-46)tilelang/language/copy.py (1)
copy(84-152)tilelang/jit/kernel.py (1)
kernel_source(470-471)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-test-amd
- GitHub Check: build-test-metal
- GitHub Check: bot-task
🔇 Additional comments (2)
testing/python/metal/test_metal_codegen.py (2)
63-63: Good test coverage for kernel source verification.The assertion that
kernel_source is not Noneproperly validates that the Metal backend generates source code as expected.
25-26: Verify coalesced_width=2 for Metal backend.
T.copy uses coalesced_width=2 in testing/python/metal/test_metal_codegen.py (lines 25–26, 31). Confirm that 2 is the optimal coalescing width for the Metal backend and the tensor element type; if not, update to the backend-specific value or make it configurable.
|
|
||
|
|
||
| def test_gemm_float16(): | ||
| assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
High tolerance values for float16 and int32 tests.
The tolerance value of atol=1 for float16 and int32 tests seems quite high. For int32, the results should match exactly (atol=0), and for float16, a smaller tolerance might be more appropriate to catch potential precision issues.
def test_gemm_float16():
- assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1)
+ assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='float16', atol=1e-2)
def test_gemm_int32():
- assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='int32', atol=1)
+ assert_gemm(1024, 1024, 1024, 16, 16, 16, dtype='int32', atol=0)Also applies to: 75-75
🤖 Prompt for AI Agents
In testing/python/metal/test_metal_codegen.py around lines 71 and 75, the tests
use an overly large absolute tolerance (atol=1) for float16 and int32; change
the int32 assertions to use atol=0 (exact match) and tighten the float16
tolerance (e.g., atol=1e-2 or another small value appropriate for float16
precision) so the tests detect precision regressions while still allowing
expected float16 rounding error.
|
Currently metal target only support |
* Reset * Fix other CUDA issue * fmt * fmt * fix cuda error * fix * fix * fmt * cleanup * fix * remove copyright * trivial update * readme update * lint fix --------- Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
* [FFI] Structural equal and hash based on reflection (apache#18156) This PR add initial support for structural equal and hash via the new reflection mechanism. It will helps us to streamline the structural equality/hash with broader support and clean error reports via AccessPath. It also gives us ability to unify all struct equal/hash registration into the extra meta-data in reflection registration. * [Fix][Relax] Fix potential out-of-bounds access in `TupleRewriterNode` (apache#18120) * Root cause * Update * [BugFix] Fix NCCL build with GlobalDef registration (apache#18158) This PR fixes a build failure in nccl.cc due to the recent switch of global function registration. * [FFI][REFACTOR] Introduce TypeAttr in reflection (apache#18160) This PR introduces TypeAttr to reflection to bring extra optional attribute registration that can be used to extend behaviors such as structural equality. Also renames TypeExtraInfo to TypeMetadata for better clarity. * [TIR] Fix trivial index map [] -> [0] (apache#18154) fix trivial index map [] -> [0] Co-authored-by: wrongtest <wrongtest@gmail.com> * [Relax] Fix issue in fuse concat ops by pattern (apache#18163) * [Relax] Fix issue in fuse concat ops by pattern * fix lint * [FFI][REFACTOR] Enable custom s_hash/equal (apache#18165) This PR enables custom shash equal via TypeAttr, also enhances the Var comparison by checking content so we can precheck type signatures. * [FFI][REFACTOR] Migrate StructuralEqual/Hash to new reflection (apache#18166) This PR migrates the StructuralEqual/Hash to new reflection based approach. The original mechanisms are still kept around and we will phase them out in followup PRs. The new mechanism unifies the structural equal/hash registration with the normal reflection registeration and also brings cleaner implementation for mismatch detection. * [Web] Fix incompatible part after FFI updates (apache#18168) fix: incompatible parts with the current core * [REFACTOR][FFI] Phase out SEqualReduce/SHashReduce (apache#18172) This PR phases out old SEqualReduce/SHashReduce mechanism in favor of the new reflection mechanism via ffi/reflection. It helps us to reduce the places we need to register the reflection related information. See the current IR examples for upgrading to the new mechanism. * [FFI] Improve string equal/hash handling (apache#18176) This PR improves the string equal hash handling by improving some of the efficiencies. * [FFI][REFACTOR] Isolate out extra API (apache#18177) This PR formalizes the extra API in FFI. The extra APIs are minimal set of APIs that are not required in core mechanism, but still helpful. Move structural equal/hash to extra API. * [Misc][BugFix] Fix missing PadAttrs register in op_attrs.py (apache#18174) * [COMMUNITY] Add new key for release signing * [Misc][BugFix] Fix missing PadAttrs register in op_attrs.py Fix missing PadAttrs register in op_attrs.py * [FFI] Fix SmallMapInit with duplicated keys (apache#18178) This PR fixes Small map init when there are duplicated keys * [Fix] Fix the wrong check for tuple node in apache#18163 (apache#18170) * [Fix] Fix the wrong check for tuple node in apache#18163 * [REFACTOR] Upgrade NestedMsg<T> to use new ffi::Any mechanism (apache#18181) This PR upgrades NestedMsg<T> to use the new ffi::Any mechanism, which will enable us to get better support and enable NestedMsg for POD types. * [FFI][REFACTOR] Cleanup to align to latest ffi (apache#18183) This PR modernizee legacy use to align with the latest FFI. - Use Any to represent general Any instead of ObjectRef - Use Optional<T>.has_value() instead of defined * [FFI][REFACTOR] Hide StringObj/BytesObj into details (apache#18184) This PR hides StringObj/BytesObj into details and bring implementations to directly focus on the String/Bytes. This change will prepare us for future changes such as SmallStr support. Also moves more ObjectRef into Any in RPC. * [FFI] Introduce small string/bytes (apache#18185) * [Relax] Fix Relax Operator PReLU (apache#18179) * [FFI] Lightweight json parser/writer (apache#18186) This PR adds a lightweight json parser/writer to extra component. * [FFI][EXTRA] Serialization To/From JSONGraph (apache#18187) * [FFI][REFACTOR] Migrate the Save/Load JSON to the new reflection (apache#18188) This PR migrates the Save/Load JSON to the new reflection based mechanism. This is a breaking change that updates the the JSON format to ffi/extra/serialization to handle the serialization, see the json graph schema comment in ffi/extra/serialization.h for the format, which roughly aligns with the old style. After this change, we no longer need node/reflection and reflection vtable. We can also phase out TVM_REGISTER_NODE and TVM_REGISTER_OBJECT to have a single place that defines the reflection. * [FFI][REFACTOR] Phase out getattr based attribute handling (apache#18189) [REFACTOR] Phase out getattr based attribute handling This PR phases out getattar based attribute handling as they are slower and introduces extra code path. This does mean that if an Object is not explicitly registered in python side, we will no longer be able to access the field by name. Likely this is also desirable as we would like to enable faster use that updates the python end and do not rely on these behavior. * [FFI][REFACTOR] Refactor AccessPath to enable full tree repr (apache#18191) This PR refactors AccessPath so it can be used to represent full tree with compact memory. Also fixes a bug in thec cython method export * [FFI] Phase out ObjectPath in favor of AccessPath (apache#18192) This PR phases out ObjectPath in favor of AccessPath * Bump cutlass_fpA_intB_gemm to latest commit (apache#18193) * [REFACTOR] Update data type rewriter to enable recursive rewrite in Any (apache#18197) This PR updates the data type rewriter so recursive rewrite happens in Any instead of ObjectRef. * [Relax][ONNX] Parse ONNX Upsample to Relax resize2d (apache#18180) * [TIR] Fix host/device function check for build (apache#18199) This PR fixes a bug of deciding whether a function is host or device function in TIR build. Previously the decision is made based on checking whether `"cpu"` is a substring of the target string. This check fails to work for ROCm target, which usually comes with an `"mcpu"` attribute that also contains `"cpu"`. This PR fixes by checking target kind. Targets with kind `"llvm"` or `"c"` will be treated as host functions. * [FFI][REFACTOR] Move Downcast out of ffi for now (apache#18198) Downcast was added for backward compact reasons and it have duplicated features as Any.cast. This PR moves it out of ffi to node for now so the ffi part contains minimal set of implementations. * [FFI][REFACTOR] Update Map ABI to enable flexible smallMap switch (apache#18200) This PR updates the Map ABI to use MSB in slots_ to indicate SmallMap. The change would open doors for future changes to small map boundary switch. * [Fix] Codegen fix for relax cutlass (apache#18190) * Codegen fix --------- Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> * [LLVM][CPPTEST] Small fixes for LLVM >= 20 (apache#18202) This PR updates cpptest to run on LLVM >= 20 * [LLVM] Fixes up to the latest LLVM21 (apache#18204) This PR fix TVM use with the latest LLVM version 21. - At this time LLVM21 is available as a release candidate. - Double checks for backward compatibility down to LLVM10 * [FFI][REFATOR] Cleanup entry function to redirect (apache#18205) This PR updates the entry function mechanism to create a stub that redirects to the real function. This new behavior helps to simplify the runtime logic supporting entry function. Also updates the name to `__tvm_ffi_main__` * [CODEGEN][REFACTOR] tir.call_llvm_intrin to remove nargs (apache#18206) This PR refactors the tir.call_llvm_intrin to omit the first nargs argument in the beginning. Previously the nargs was introduced when prefetch have different number of signature. The previous reason no longer stands as of now, and it is less intuitive to attach nargs for the call_llvm_intrin, where nargs directly appears in number of argument. After the update, tir.call_llvm_intrin can directly pass in the arguments as it is. * [FFI][Fix] Update datatype registry calls to the new paths (apache#18208) * [TARGET] Add target for nvidia rtx 5060ti (apache#18211) * [Bug] Fix core dump in InferLayoutRMSNorm and fix typo (apache#18210) Fix core dump in InferLayoutRMSNorm and fix typo * [FFI] Make JSON Parser/Write fastmath safe (apache#18212) This PR adds fallbacks for nan and inf detection/creation under fastmath mode. * [ONNX][FRONTEND][Fix] Update Resize to accept ShapeExpr (apache#18209) [ONNX][FRONTEND] Update Resize to accept ShapeExpr * [FFI] Formalize ffi.Module (apache#18213) This PR formalizes original runtime::Module into ffi as ffi.Module and cleans the APIs around it. The goal is to stablize the Module API as extra API that can benefit the overall ffi interactions. We also refactors the c++ code that depends on the Module. * [Fix][ONNX] No precision widening for numpy binary operations (apache#18207) * [FFI][REFACTOR] Establish ffi.Module in python (apache#18214) This PR refactors and establishes ffi.Module under the python tvm ffi api. Also moves export_library method to executable so it aligns more with compiled artifact. * [FFI][REFACTOR] Establish Stream Context in ffi (apache#18216) This PR sets up the stream context in ffi and migrate the existing per device API stream context management to ffi env API. The new API will help us to streamline stream related integration formost libraries. * [FFI] AudoDLPack compatible with torch stream context (apache#18217) This PR updates the autodlpack path to automatically update the env stream to be consistent with torch stream context. The change would help to make FFI functions to be compatible in stream based executions. We leverage torch cpp_extension load_inline to create an efficient query function, the first time loading might take more time to build the jit module and things should be fast after the torch jit module is cached. * [FFI][REFACTOR] Cleanup API locations (apache#18218) This PR cleans up the env api and move it to extra. * [Fix] Resolve deadlock in PopenPoolExecutor and LocalBuilder (apache#18219) - Add explicit shutdown flag in PopenPoolExecutor - Replace del with explicit shutdown() calls in LocalBuilder * [Thrust] Fix getting CUDA stream (apache#18220) This PR updates the `GetCUDAStream` in CUDA thrust integration to the latest `TVMFFIEnvGetCurrentStream` interface. * [NVSHMEM] Fix compatibility with CUDA code without nvshmem use (apache#18222) This PR fixes two bugs that cause normal TIR functions (ones that don't use any NVSHMEM API) not being able to compile and run, in cases where `set(USE_NVSHMEM xxx)` is enabled. Co-authored-by: Bohan Hou <bohanhou@andrew.cmu.edu> * [FFI] Fix JSON parser/writer for the fast-math flag (apache#18221) This PR fixes the JSON parser and writer for the support of the fast-math flag. Prior to this PR, there would be the following error when compiling TVM with fast-math. This PR fixes this issue. ``` /home/ruihangl/Workspace/tvm/ffi/src/ffi/extra/json_writer.cc: In static member function ‘static bool tvm::ffi::json::JSONWriter::FastMathSafeIsNaN(double)’: /home/ruihangl/Workspace/tvm/ffi/src/ffi/extra/json_writer.cc:69:22: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing] 69 | uint64_t bits = *reinterpret_cast<const uint64_t*>(&x); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/ruihangl/Workspace/tvm/ffi/src/ffi/extra/json_writer.cc: In static member function ‘static bool tvm::ffi::json::JSONWriter::FastMathSafeIsInf(double)’: /home/ruihangl/Workspace/tvm/ffi/src/ffi/extra/json_writer.cc:84:22: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing] 84 | uint64_t bits = *reinterpret_cast<const uint64_t*>(&x); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` * [ROCm] Minor fixes for latest refactor (apache#18225) This PR fixes a few ROCm and hipBLAS build issues after recent refactors. * [CI] Exit the build for AbortException (apache#18227) [CI] Exit the build if met AbortException * [FFI][REFACTOR] Establish tvm_ffi python module (apache#18226) * [FFI][REFACTOR] Establish tvm_ffi as a standalone python module This PR establishes tvm_ffi as a standalone python module. The ffi is structured as a minimal pip module that can be directly install by path or url. examples/get_started provided a minimal example. This is a major change as we are decoupling tvm_ffi as a separate package, users need to install tvm_ffi separately. Thanks to its minimal dependency, tvm_ffi can be easily installed even just from the source by pip install ./ffi This change would enable future improvement for library plugins to have lightweight dependencies by just working on top of the tvm_ffi, while the main compiler toolchain and runtime can be layered on top. * [FFI] Improve traceback setups This PR improves traceback related setups * [FFI] Robustify the pyproject setup (apache#18233) This PR robustifies the pyproject setup to enable compact with cibuildwheel * [LLVM][Fix] Do not emit debuginfo on vscale or other unknown types (apache#18232) * [FFI] Misc fixup for windows (apache#18234) This PR cleans up the ffi module to make it compatible for windows. * [FFI][BUGFIX] Fix type_traits on DataType after SmallStr update (apache#18237) This PR fixes the type_traits on DataType after SmallStr update. We need to explicitly zero out the FFFIAny data structure to allow fast comparison of FFIAny based on bytes values. * [CUTLASS] Fix CUTLASS kernel compilation (apache#18238) This PR fixes a few places in the current CUTLASS kernel AOT compilation. * [Relax] ONNX frontend using relax softplus operator (apache#18242) Use relax softplus operator in onnx frontend * [Relax] Add Python function support and BasePyModule for PyTorch integration (apache#18229) ### **Overview** This PR implements native Python function support in TVM Relax through the `@I.pyfunc` decorator and `BasePyModule`, which enable seamless integration between TVM's compilation pipeline and Python/PyTorch runtime environments. This enhancement allows users to write Python functions directly in TVMScript that can interoperate with Relax and TIR functions that provides enhanced debugging capabilities and leveraging existing PyTorch operator libraries. ### **Key Features** **TVMScript Parser Enhancement** - `@I.pyfunc` decorator: Marks Python functions for integration into IRModules - Dual storage format: Stores both raw string representation (for TVMScript printing) and captured PackedFunc (for runtime execution) - ExternFunc representation: Each Python function is represented as an ExternFunc node with attributes storing source code and runtime wrapper **Complete BasePyModule Implementation** - DLPack-based tensor conversion: Seamless conversion between PyTorch tensors and TVM NDArrays - Cross-function interoperability: Python functions can call Relax/TIR functions and vice versa - JIT compilation: Delays compilation until module instantiation for flexible late-stage modifications - Dynamic function registration: Supports runtime addition of Python functions ### Future Work - TVMScript printer for IRModules with Python functions: Print IRModules in proper format with high-level operator mapping from Relax ops to PyTorch ops, handling symbolic shapes - R.call_py_func primitive: Introduce Relax primitive to invoke corresponding PackedFunc of specified Python functions at runtime * [Fix] Update FlashInfer JIT header lookup (apache#18244) This PR fixes the tvm/dlpack/dmlc header lookup in the FlashInfer kernel JIT compilation. Prior to this fix, the JIT compilation assumes the environment variable `TVM_SOURCE_DIR` is always defined, which is not always true. This PR fixes the behavior and considers multiple cases, including TVM source builds and pip-installed packages. * [LLVM][MSWIN][CI] Fix LLVM module build with latest CI update (apache#18245) * [FFI][CMAKE] Add missing download path for libbacktrace (apache#18246) * [Build] Migrate Python packaging to pyproject.toml with scikit-build-core (apache#18239) This pr migrates the TVM Python packaging system from the setup.py flow to the modern, PEP 517/518 compliant pyproject.toml standard, which allows us to produce a single, Python-version-agnostic wheel. This change streamlines the process for both developers and users. For local development, you can now set up a fully-functional editable environment with a single command: `pip install -e .`. To create the distributable package for release, simply run `pip wheel -w dist .` , which will produce a universal wheel in the `dist/` folder. This ensures that end-users can reliably install TVM with a standard pip install tvm, regardless of their specific Python 3 version. * [FFI][CMAKE] Revert cmake libbacktrace URL and update submodule (apache#18249) * Revert the URL out from cmake for libbacktrace * Switch git submodule to upstream HEAD instead As per discussed here apache#18246 (comment), this reverts in favour of git submodule way. As per finding in the same discuss the upstream [already](https://github.com/ianlancetaylor/libbacktrace/blob/793921876c981ce49759114d7bb89bb89b2d3a2d/macho.c#L1273-L1275) incorporates [the one patch](ianlancetaylor/libbacktrace@master...tlc-pack:libbacktrace:master) used, and MacOS works fine. * [Python] Update version.py to bump pyproject.toml automatically (apache#18248) This PR updates the `version.py`, so every time when running this file, it also bumps the version number in `pyproject.toml` automatically. * [Python] Complete Python packaging with scikit-build-core (apache#18251) Following apache#18239, this PR fixes a few issues we ran into during testing the packaging flow through scikit-build-core. * upgrade cutlass v4.2.0 supporting cuda 13 (apache#18236) * upgrade cutlass v4.2.0 supporting cuda 13 * upgrade cutlass v4.2.0 supporting cuda 13 * [FFI][ABI] ABI Updates to for future metadata and complex ordering (apache#18254) This PR updates the ABI to enable potential future need for getting metadata from a dynamically loaded module. Orders the current static object into simple objects that have C ABI and more complex one that may need c++. These items changes ABI to be future compact before we freeze. * [FFI][DOCS] Wheel Packaging (apache#18256) [FFI] Wheel packaging example This PR add an example about wheel packaging. Also fixes various source packaging minor nits. * [FFI] fix two seemingly migration issue (apache#18258) * [FFI][ABI] Introduce weak rc support (apache#18259) This PR adds weak ref counter support to the FFI ABI. Weak rc is useful when we want to break cyclic dependencies. - When a strong rc goes to zero, we call the destructor of the object, but not freeing the memory - When both strong and weak rc goes to zero, we call the memory free operation The weak rc mechanism is useful when we want to break cyclic dependencies in object, where the weak rc can keep memory alive but the destructor is called. As of now, because we deliberately avoid cyles in codebase, we do not have strong use-case for weak rc. However, given weak rc is common practice in shared_ptr, Rust RC, and also used in torch's c10::intrusive_ptr. It is better to make sure the ABI is future compatible to such use-cases before we freeze. This PR implements weak rc as a u32 counter and strong rc as a u64 counter, with the following design consideration. - Weak rc is very rarely used and u32 is sufficient. - Keeping weak rc in u32 allows us to keep object header size to 24 bytes, saving extra 8 bytes(considering alignment) We also need to update deleter to take flags that consider both weak and strong deletion events. The implementation tries to optimize common case where both strong and weak goes to 0 at the same time and call deleter once with both flags set. * [FFI][DOCS] Add missing files in packaging example (apache#18261) This PR adds the missing files in packaging example also renames get_started to quick_start * [BugFix][NNAPI] Use kind() instead of type_key() after FFI refactor (apache#18262) [BugFix][NNAPI] Use kind() after FFI refactor This commit updates nnapi_runtime.cc to override kind() instead of type_key(), aligning NNAPI with the new FFI interface. Behavior is consistent with other runtimes that were updated in commit b8eb80b. * [FFI][DOCS] Initial docs scaffolding (apache#18263) * [DOCS] Misc docs fix (apache#18264) This PR provides misc docs fix, updates the requirements of ffi docs remove stale webpages from header, update embedding script to allow path. * [Build] Complete TVM wheel building migration (apache#18252) * finish1 * finish2 * finish3 * update * update2 * update3 * update4 * update4 * update6 * Rename build step and update installation commandFix * fix * fix2 * fix3 * [Relax] Building TVMScript printer for IRModules with Python functions (apache#18253) This PR implements TVMScript printer to format IRModules containing `@I.pyfunc` decorated Python functions. Example: ``` @I.ir_module class MyModule(BasePyModule): @I.pyfunc def python_func(self, x, y): x_tvm = self._convert_pytorch_to_tvm(x) y_tvm = self._convert_pytorch_to_tvm(y) result = self.call_tir(self.add_tir, [x_tvm, y_tvm], out_sinfo=R.Tensor((5,), "float32")) return self._convert_tvm_to_pytorch(result) @T.prim_func def add_tir(a: T.handle, b: T.handle, c: T.handle): A = T.match_buffer(a, (5,), "float32") B = T.match_buffer(b, (5,), "float32") C = T.match_buffer(c, (5,), "float32") for i in range(5): C[i] = A[i] + B[i] # Usage: print(MyModule.script()) # Print formatted TVMScript MyModule.show() # Display formatted output ``` * [FFI] Update torch stream getter to use native torch c api (apache#18266) This PR updates the torch stream getter to use _cuda_getCurrentRawStream in the torch C API that is also used by dynamo, saves us from load_inline the custom module. * [FFI] Support Opaque PyObject (apache#18270) * [FFI] Support Opaque PyObject This PR adds support of Opaque PyObject. When a type in python is not natively supported by ffi, it will now be converted to an Opaque PyObject on the backend, such opaque object will retain their lifecycle automatically and can still be used by registering python callbacks or store in container and return to the frontend. * Round of grammar polishment * [FFI] Support inline module (apache#18271) This PR adds initial support for load_inline in tvm_ffi * [FFI] Construct NDArray.strides by default (apache#18272) This PR updates NDArray.strides to construct strides by default * [FFI][ABI] Append symbol prefix for ffi exported functions (apache#18273) Previously we simply take the raw symbol for DSO libraries. This can cause symbol conflict of functions that take the ffi calling convention and those that are not. This PR updates the convention to ask for LLVM and libary module to always append a prefix __tvm_ffi_ to function symbols, this way we will no longer have conflict in TVM_FFI_EXPORT_DLL_TYPED macro * [FFI] Update the interface of `ffi.load_inline` to match torch (apache#18274) This PR update the interface of ffi.load_inline to match torch.utils.cpp_extensions.load_inline: - Rename cpp_source to cpp_sources, cuda_source to cuda_sources. - Unify the cpp_functions and cuda_functions into functions. - Add build_directory to allow the user to specify the build directory directly. * [FFI][REFACTOR][ABI] Rename NDArray to Tensor (apache#18275) This PR Updates the NDArray => Tensor. Both tensor and ndarray are commonly used terms. Because the term Tensor is getting more common in the context of ML, we do the rename to stay more aligned with torch.Tensor and DLTensor. * [FFI] Add ffi::Tensor.strides() (apache#18276) * ffi::Tensor strides * [FFI][REFACTOR] Cleanup tvm_ffi python API and types (apache#18277) This PR cleans up the python API to make things more consistent with existing python array api and torch. Device update - device_id => index, to be consistent with torch - device_type => dlpack_device_type() returns int - added type property same as torch.device API updates: - Move the convenient method like cpu() out into tvm runtime to keep device minimal - tvm_ffi._init_api => tvm_ffi.init_ffi_api - tvm_ffi.register_func => tvm_ffi.register_global_func * [FFI] Temp skip load_inline tests nonlinux (apache#18278) This PR temp skip load_inline tests on nonlinux before we enhance and improve for other platforms. * [LLVM][METASCHEDULE] Add RISCV V-extension v1.0 kernels to metaschedule (apache#18243) - Enables high performance kernels covering majority of usual ML datatype inputs - It is currently compliant with RVV specs version v1.0 (does not work with older v0.7.1) - TIR kernels implemented here are using recently added VLA extension support * [FFI][DOCS] Initial bringup of cpp docs (apache#18279) This PR brings up initial version of cpp api docs. * [FFI][Bugfix] Fix bug of `ffi.cpp.load_inline` on Windows (apache#18281) This PR enables the load_inline on windows platform: * [FFI][REFACTOR] Cleanup namespace (apache#18280) * [FFI][REFACTOR] Cleanup namespace This PR cleansup the namespace to ensure all ffi classes are accessed through ffi:: namespace. It will helps to cleanup the ffi package before isolation. * fix hexagon * [FFI] Relax default alignment and continguous requirement (apache#18282) This PR relax default alignment and continguous requirement in dlpack import. This allows the ffi to be useful in most settings. We also provide utility for users to check these requirements themselves. * [Fix][Metal] Fix type for device array in Metal API (apache#18283) This PR fixes a typo in the previous ffi namespace cleanup. * [Relax] Add Relax to Python Function Converter (apache#18269) ### Overview This PR implements a Relax to Python Function Converter that transforms Relax functions into executable Python functions using PyTorch operations. This enables seamless conversion between TVM's Relax IR and Python/PyTorch environments, which provides enhanced debugging capabilities and leveraging existing PyTorch operator libraries for testing and deployment purposes. ### Key Feature - **High-level operator mapping**: Maps 60+ Relax operators to corresponding PyTorch APIs - **Special operation handling**: Supports `call_tir`, `call_dps_packed`, and Relax function calls with DLPack integration - **Symbolic shape support**: Handles symbolic shapes and dynamic tensor operations ### **Example** ```python from tvm.relax.relax_to_pyfunc_converter import RelaxToPyFuncConverter # Convert Relax functions to Python functions converter = RelaxToPyFuncConverter(ir_module) converted_ir_mod = converter.convert("my_function") # Execute converted function with PyTorch tensors result = converted_ir_mod.pyfuncs['my_function'](input_tensor) ``` * [FFI][REFACTOR] Introduce UnsafeInit and enhance ObjectRef null safety (apache#18284) This PR enhances the nullptr and general type-safe of ObjectRef types. Previously ObjectRef relies on constructor from ObjectPtr<Object> for casting and initialize from nullptr. We introduce a tag ffi::UnsafeInit, which explicitly states the intent that the initialization is unsafe and may initialize non-nullable Ref to null. Such tag should only be used in controlled scenarios. Now the general RefType(ObjectPtr<Object>) is removed. We still keep RefType(ObjectPtr<ContainerType>) for nullable objects, but removes the default definition from non-nullable types, knowing that user can always explicitly add it to class impl (ensuring null checking). * [Hotfix] Fix the conflicts about ffi-related updated names (apache#18287) * Change registration of mock softmax function * Update check_asf_header.sh Remove unnecessary blank line in check_asf_header.sh * Update check_asf_header.sh * fix * [FFI][Bugfix] Enable `load_inline` on macos (apache#18285) This PR fix the bug to enable `tvm_ffi.cpp.load_inline` on macos. We need to link the `libtvm_ffi.dylib` to the custom module. * [Metal] Fix MetalModuleCreate (apache#18290) This PR fixes a type mismatch in MetalModuleCreate when initializing a MetalModule. The error does not show up until the recent ObjectRef null safety. * [3rdparty] Bump cutlass_fpA_intB_gemm to fix SM90 build (apache#18291) This PR fixes a SM90 build issue when CUTLASS is enabled. The issue is because a source file indluced a CUTLASS header file that has been removed since CUTLASS 4. Simply removing the header fixes the build issue. * [FFI][REFACTOR] Streamline Object Declare Macros (apache#18289) * [Fix] Set DRefObj and CUDAIPCMemoryObj as mutable (apache#18294) This PR marks `DRefObj` and `CUDAIPCMemoryObj` as a mutable object classes. The flags are missed during previous macro refactor. * [FFI][ABI] Introduce generic stream exchange protocol (apache#18295) This PR adds a __tvm_ffi_env_stream__ protocol for generic tensors to exchange env stream to tvm ffi. Also renames TVMFFIEnvSetStream to TVMFFIEnvSetCurrentStream. * [FFI] Temp skip windows tests (apache#18297) * [Fix] Add libxml2 dependency to fix Windows CI build failure (apache#18296) * [FFI] Fix system library symbol lookup (apache#18298) * [Relax] Add symbolic shape support to BasePyModule for dynamic tensor operations (apache#18288) This PR adds symbolic shape support to `BasePyModule`, which enables dynamic tensor operations with runtime shape inference. This allows users to use Relax's symbolic shape functionality in Python function calls through BasePyModule, with dimensions automatically resolved at execution time based on input tensor shapes. ## Usage Example ```python import tvm from tvm.script import ir as I, relax as R from tvm.relax.base_py_module import BasePyModule import numpy as np @I.ir_module class VectorAddModule(BasePyModule): @R.function def add(x: R.Tensor(("n",), "float32"), y: R.Tensor(("n",), "float32")) -> R.Tensor(("n",), "float32"): return R.add(x, y) module = VectorAddModule(device=tvm.cpu(0), target="llvm") a = np.array([1.0, 2.0, 3.0], dtype="float32") b = np.array([4.0, 5.0, 6.0], dtype="float32") result = module.add(a, b) # Result: [5.0, 7.0, 9.0] ``` * [CUDA] Support NVTX in CUDA 13 (apache#18300) This PR adds the support of NVTX for CUDA 13. The change is because that starting CUDA 13, the nvtx functions are moved to the lirbary of `libnvtx3interop.so`, and the previous nvToolsExt library no longer exists. To ensure compatibility with both CUDA 12 and 13, we add libnvtx3interop.so to the library lookup list. * [Python] Fix runtime tensor import (apache#18299) This PR fixes a few places where the python import of runtime tensor is incorrect. The error wasn't revealed in the previous NDArray->Tensor rename PR since these imports are not at the top level. * [FFI][REFACTOR] Refactor python ffi call mechanism for perf (apache#18302) This PR refactors python ffi call mechanism. Previously the argument setting can become an as things can be sensitive to the if checking order. This PR refactors the calling to leverage a C++ based dispatcher where each dispatch functor can be registered from Cython. * Clear ext_lib_dll_names for macOS platform (apache#18304) Removed external library DLL names for macOS. found during tile-ai/tilelang#799 cc @LeiWang1999 * [Relax] Fix RelaxToPyFuncConverter compatibility and improve fallback handling (apache#18301) This PR fixes multiple compatibility issues in `RelaxToPyFuncConverter` caused by recent TVM API changes and improves the robustness of fallback tensor handling. * [FFI][ABI][REFACTOR] Enhance DLPack Exchange Speed and Behavior (apache#18306) This PR enhances DLPack exchange by introducing DLPackPyObjectExporter, DLPackPyObjectImporter and DLPackTensorAllocator. These three function pointers will help us to speedup import/export with DLPack and also streamline the rare(but still useful sometimes) allocation inside the FFI. They can help to significantly speedup autodlpack import. They will also enable us to be able to query the allocator from env and return ffi::Tensor back to the caller environment(experimental), when a function takes torch.Tensor as argument, returned Tensor values will be converted to torch.Tensor. Also renames SetCurrentStream => SetStream to align with styles in CUDA API. Finally, we add option to select whether we release GIL, we release gil by default like ctypes, however, for short running functions it may be helpful to set func.release_gil = False * [FFI] Update `load_inline` interface (apache#18307) update load_inline interface * [FFI][ABI] Refactor the naming of DLPack speed converter (apache#18308) Update the name to avoid potential confusion * [FFI][ABI] Better String and Nested Container handling (apache#18311) [FFI][ABI][REFACTOR] Better String and nested container handling This PR improves the overall String/Bytes and nested container handling It also fixes a bug for temp object recycling when temp object. - Introduce formal API for string/bytes creation - Updates the tuple/dict conversion to also preserve the torch stream - So if a function takes a list of torch.Tensor, torch stream will be setup in context - Optimizes recursive argument conversion by moving most logic into c++ * [FFI][REFACTOR] Update TVM_FFI_STATIC_INIT_BLOCK to fn style (apache#18312) This PR updates TVM_FFI_STATIC_INIT_BLOCK to function style. Now we do the code as follows, which is cleaner in generally and also helps error reporting to locate the right place. ``` TVM_FFI_STATIC_INIT_BLOCK() { RegisterStaffs(); } ``` * [REFACTOR][FFI] Split tvm-ffi into a separate repo (apache#18314) This PR updates the code so we split tvm-ffi into a separate repo * [FlashInfer] Update include path and interface (apache#18317) This PR updates the include path for FlashInfer JIT compilation, and also updates the plan function interface for attention prefill computation, to align with recent interface change in flashinfer-ai/flashinfer#1661. * [3rdparty] Remove dlpack/libbacktrace from 3rdparty (apache#18318) [3rdparty] Remove dlpack/libbactrace from 3rdparty This PR removes the TVM dependency on dlpack and libbacktrace, as tvm-ffi being separated to https://github.com/apache/tvm-ffi. * [TVMScript] Support continue and break in tvmscript (apache#17804) * support continue and break in tvmscript * fix black format * fix pylint issue * Update tests/python/tvmscript/test_tvmscript_syntax_sugar.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add printer/parser test, fix lint * Fit to latest ffi update * Skip i386 numpy-related test * Introduce AnnotateIrregularLoop before any lowering loop expansions. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Relax] Introduce R.call_py_func operator for calling Python functions from Relax IR (apache#18313) This PR allows calling Python functions directly from Relax IR, where integration between Relax computations and Python/PyTorch operations can be supported. ### Usage Example ```python @I.ir_module class MyModule(BasePyModule): @I.pyfunc def pytorch_add(self, x, y): return x + y @R.function def compute(x: R.Tensor((5,), "float32"), y: R.Tensor((5,), "float32")) -> R.Tensor((5,), "float32"): result = R.call_py_func("pytorch_add", (x, y), out_sinfo=R.Tensor((5,), "float32")) return result ``` * [TIR][CUDA] Preserve float precision in codegen with hexfloat output (apache#18320) Previously, `float` constants in codegen were always emitted in **scientific decimal format**, e.g.: ```cpp bfloat16_t(3.487723e-05f); ``` This could introduce slight **rounding differences** compared to the actual binary representation, since the constant is printed and then re-parsed in decimal. we now emit the value in **hexadecimal floating-point format** (`std::hexfloat`) to preserve the exact binary value, and additionally include the decimal form as a comment for readability: ```cpp bfloat16_t(0x1.2492492492492p-15f /*3.487723e-05*/) ``` * [BugFix] Fixing binding for bert (apache#18324) * Fixing binding for bert * Fixing names * [TIR] Add support for conditional expressions in TVMScript (apache#18323) Add support for conditional expressions in TVMScript This PR adds support for conditional expressions in TVMScript parser, which allows developers to use Python-style conditional expressions ```python @T.prim_func def func(A: T.buffer((128, 128), "float32")): for i, j in T.grid(128, 128): A[i, j] = i if i < j else j @T.prim_func def expected(A: T.buffer((128, 128), "float32")): for i, j in T.grid(128, 128): A[i, j] = T.if_then_else(i < j, i, j) ``` * Fixing datatype error for gpt-2 (apache#18328) * [CMake][Web] Install `web/` directory in cmake for Python package (apache#18327) This PR updates the CMakeLists to install the web subdirectory when building Python package, so that people do not need to clone TVM source code to build web package. * [Relax][Backend] Implement R.call_py_func operator for calling Python functions from compiled TVM (apache#18326) This PR implements the `R.call_py_func` operator that allows compiled TVM Relax modules to call Python functions at runtime. This enables integration between TVM's compiled code and Python through a robust VM backend implementation. #### Simple Usage with BasePyModule ```python @I.ir_module class MyModule(BasePyModule): @I.pyfunc def torch_relu(self, x): return torch.relu(x) @R.function def forward(x: R.Tensor((10,), "float32")) -> R.Tensor((10,), "float32"): return R.call_py_func("torch_relu", (x,), out_sinfo=R.Tensor((10,), "float32")) ``` #### Direct VM Backend Usage (Manual) ```python # Manually register Python function with VM backend register_func = tvm.get_global_func("vm.builtin.register_py_func") register_func("my_func", my_python_function) # Use in Relax function (compiled to VM backend) @R.function def test(x: R.Tensor((5,), "float32")) -> R.Tensor((5,), "float32"): return R.call_py_func("my_func", (x,), out_sinfo=R.Tensor((5,), "float32")) # Manual cleanup (required for direct VM backend usage) clear_func = tvm.get_global_func("vm.builtin.clear_py_func_registry") clear_func() ``` * [flashinfer] Support directing JIT to FlashInfer GroupedGemm kernels (apache#18325) in tvm/python/tvm/relax/backend/cuda/flashinfer.py added a `gen_grouped_gemm_module` in tvm/tests/python/relax/test_group_gemm_flashinfer.py added tests for different combinations of - input and output types: ("float8_e4m3fn", "float8_e4m3fn", "bfloat16"), ("float8_e4m3fn", "float8_e4m3fn", "float16"), - scale granularity of m, n, k: (1, 128, 128), - scale major mode: "MN", "K" - mma_sm: 1, 2 - different batch sizes and m_sizes * [Relax][Frontend][ONNX] Error converting operator Expand: TVMError: broadcast_to expects the input tensor shape is broadcastable to the target shape (apache#18329) * [Relax] Operator and RoPE support for Llama4 (apache#18336) Added LLama4 implementation, new rope implementation * Fix conflict parameter name promote_dtye in FP8ComputeLegalize (apache#18334) * [FFI][ABI] Bump version ffi to latest (apache#18332) This PR bumps the version of tvm-ffi to latest, which involves an ABI change. * [Relax][PyTorch] Support MatrixMultiply op for ExportedProgram importer (apache#18343) This pr supports `mm.default` for ExportedProgram importer. Resolves the issue apache#18339. * [Relax] Update BasePyModule with faster DLPack converter for tensor conversion (apache#18331) This PR enhances `BasePyModule` by integrating a faster DLPack converter for efficient tensor conversion between TVM and PyTorch following apache#18306. * [TIR] Support sequence comparisons in TVMScript (apache#18341) Implement proper parsing and evaluation of chained comparison operators (e.g., `0 < i < 128`) in TVMScript. The sequence comparisons are now correctly expanded to their logical equivalents (e.g., `(0 < i and i < 128)`). Changes: - Updated expression evaluator to handle sequence comparisons correctly - Added test case to verify sequence comparison functionality * [FFI][ABI] Bump tvm-ffi version to reflect RC ABI Update (apache#18345) This PR bumps tvm-ffi version. The latest version contains a change to the RC ABI that also needs web runtime update. * [Python] Add library lookup path for tvm installed as a pakcage (apache#18348) [Python] Add library lookup path when tvm installed as a pakcage * [FFI][ABI] Bump tvm-ffi to latest (apache#18349) This PR bumps tvm-ffi to latest. Which introduces ShapeView andminimizes TensorObj ABI. * [Relax][Frontend][Torch] Fix parsing error when input dimension of unbind is 1 (apache#18351) * [Relax][Frontend][Torch] Fix parsing error when input dimension of unbind is 1 * reformat code * [Fix] Update ShapeView use in nccl.cc (apache#18352) This PR fixes the use of ShapeView in nccl.cc, which was using `Shape()->Product()`. This has been changed to `Shape().Product()` with the introduction of ShapeView. * [Relax][PyTorch] Support lstm op for ExportedProgram importer (apache#18346) This pr supports `lstm.input` for ExportedProgram importer. This links to issue apache#18340 * [Relax][ONNX] Support AllClassNMS Operator for ONNX Frontend (apache#18321) Follow apache#18175 , this PR supports AllClassNMS Operator for ONNX Frontend * [FFI][ABI] Bump tvm-ffi to latest (apache#18354) This pr bumps the tvm-ffi module to latest * [CUDA] Update FlashInfer JIT integration (apache#18353) Following recent JIT refactor in FlashInfer that uses TVM FFI as the JIT interface, this PR updates the JIT integration of FlashInfer in TVM. Major changes: * we leverage FlashInfer's `JitSpec.build_and_load` to compile all the JIT-generated source files, and remove the compilation logic in TVM. * for efficient tensor buffer management and efficient pointer calculation, we enforced all `byte_offset` fields of auxiliary tensors in KV cache to be zeros. The byte offset is now directly applied to the data pointers. * we also add a new parameter to FlashInfer JIT that controls whether returning a linked shared library, or a list of compiled object paths. For unit tests, returning a shared library is convenient and preferred, while for cases such as MLC model compilation, object files are needed to serialize the compiled model. * rebase && update tvm-ffi * fix(kvcache): restore GPU support lost during rebase --------- Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> Co-authored-by: Phoslight <136557413+Phoslight@users.noreply.github.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: wrongtest <wrongtest0@gmail.com> Co-authored-by: wrongtest <wrongtest@gmail.com> Co-authored-by: chenxinli <39092231+cccxinli@users.noreply.github.com> Co-authored-by: Park Woorak <grf0503@gmail.com> Co-authored-by: ysh329 <ysh329@users.noreply.github.com> Co-authored-by: Yong Wu <yongcale@gmail.com> Co-authored-by: Chenfan <jcf94@outlook.com> Co-authored-by: Balint Cristian <cristian.balint@gmail.com> Co-authored-by: zyl_keep_moving <dream20151224@163.com> Co-authored-by: Wei Wang <w10493wang@163.com> Co-authored-by: Bohan Hou <bohanhou@andrew.cmu.edu> Co-authored-by: Marcel Dudek <43888122+MarcelDudek@users.noreply.github.com> Co-authored-by: Shushi Hong <820958424@qq.com> Co-authored-by: Johnny <johnnync13@gmail.com> Co-authored-by: Henry Hsieh <72457607+Henryshsieh@users.noreply.github.com> Co-authored-by: Yaoyao Ding <dingyaoyao.cs@gmail.com> Co-authored-by: Yaxing Cai <caiyaxing666@gmail.com> Co-authored-by: Yichen Yan <oraluben@outlook.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by: Thais Camacho <thaiscamachoo@gmail.com> Co-authored-by: Siyuan Feng <25500082+Hzfengsy@users.noreply.github.com> Co-authored-by: Anrui(Henry) Liu <98249030+neurusL@users.noreply.github.com> Co-authored-by: Neo Chien <cchung100m@cs.ccu.edu.tw> Co-authored-by: Pranav Venkatram <56809863+giterator@users.noreply.github.com> Co-authored-by: Qingchao Shen <qingchaoshen@outlook.com> Co-authored-by: Masahiro Hiramori <Hiramori.Masahiro@ct.MitsubishiElectric.co.jp> Co-authored-by: Ruxiao Yin <78540598+eaten-cake@users.noreply.github.com> Co-authored-by: cwx <you@example.com>
Add metal backend via tvm's metal codegen and torch's
torch.mps.compile_shader, and create a newtorchbackend, who only supports MPS for now. Some tasks are left for future works.Performance numbers:
PyTorch's MPS backend uses matrixMultiplicationWithPrimaryTensor by default, which is faster than default implementation.
For pytorch's native MPS mm impl (enable with
PYTORCH_MPS_PREFER_METAL=1), we can observe significant speedup.In general, tilelang is 2-3x faster than torch native impl and 2-3x slower than MPSGraph impl. tilelang impl could be optimized via implementating
T.gemmby matrixMultiplicationWithPrimaryTensor.Outliners in the table below seem to be error and disappear after rerun.
script: https://gist.github.com/oraluben/89e54b0b687a1fa483c38429442a6d1e
TODOs:
torch.mpsdevice and unify cuda withtorch.accelerator(or we should use cuda by default and do this later?)Future work:
waiting for [MPS] Extension segfault after #152210 (and >= 2.8) pytorch/pytorch#163721
(performance) Dispatch to MPSGraph (e.g. mm) for supported ops for better performanceImplement
T.gemmviamatrixMultiplicationWithPrimaryTensorSummary by CodeRabbit
New Features
Improvements
Tests
Chores
Documentation