Local LLM inference on AMD GPUs and Apple Silicon — no ROCm, no MLX, one binary.
35B parameter model running locally — Zig + Vulkan/Metal, no ROCm, no MLX
| Platform | GPU | Backend | Status |
|---|---|---|---|
| Linux | AMD RDNA4 (RX 9070, AI PRO R9700) | Vulkan | Primary — hand-tuned shaders |
| Linux | AMD RDNA3 (RX 7900 XTX, etc.) | Vulkan | Supported |
| macOS | Apple Silicon (M1, M2, M3, M4, M5) | Metal | Supported — native MSL shaders |
Works the same on Linux (AMD GPU) and macOS (Apple Silicon):
git clone https://github.com/zolotukhin/zinc.git
cd zinc
zig build -Doptimize=ReleaseFast
# On RDNA4 Linux, enable cooperative matrix
export RADV_PERFTEST=coop_matrix # skip on macOS
# Verify GPU, shaders, and runtime
./zig-out/bin/zinc --check
# See which models fit this machine
./zig-out/bin/zinc model list
# Download a model
./zig-out/bin/zinc model pull qwen3-8b-q4k-m
# Run a prompt (--chat applies the model's chat template for instruct models)
./zig-out/bin/zinc --model-id qwen3-8b-q4k-m --prompt "Hello" --chat
# Or open the chat UI in your browser
./zig-out/bin/zinc chatThe server exposes the built-in chat UI at / and an OpenAI-compatible API at /v1.
- Single-stream CLI inference on the validated models listed below
- OpenAI-compatible
/v1API with streaming - Built-in browser chat UI with thinking mode support
- Managed model workflow:
list,pull,use,active,rm zinc chat— start server and open browser in one command- AMD path: RDNA4-tuned Vulkan shaders (wave64, cooperative matrix, fused ops)
- Apple Silicon path: native Metal shaders (MSL, zero-copy mmap, simdgroup ops)
- Auto-detection: ZINC picks the right backend (Vulkan or Metal) at build time
- Continuous batching and multi-tenant serving are still roadmap work
- The supported-model list is intentionally narrow
- Apple Silicon performance tuning is ongoing (RDNA4 path is more mature)
Consumer GPUs have the hardware for fast LLM inference — bandwidth, compute, VRAM — but the software doesn't use it:
- AMD RDNA3/RDNA4: ROCm doesn't support them. vLLM requires ROCm. llama.cpp's Vulkan path has no RDNA-specific tuning. These $500–1500 cards sit idle.
- Apple Silicon: MLX and llama.cpp Metal work, but leave performance on the table. No engine is built from scratch around Metal's strengths (unified memory, simdgroup ops, zero-copy mmap).
ZINC builds an inference engine tuned for the hardware you actually have.
Hand-tuned shaders for each platform. On AMD: wave64, cooperative matrix, architecture-aware tiling via Vulkan compute. On Apple Silicon: native MSL kernels with simdgroup reductions, zero-copy model loading, and Metal pipeline tuning. Not a generic backend that happens to run — built to extract real performance from each GPU.
One binary, no driver stack. No ROCm, no CUDA, no Python. Build with Zig, point at a GGUF, run inference. The right backend (Vulkan or Metal) is selected automatically at build time.
Drop-in compatible. OpenAI-compatible API, built-in chat UI, managed model catalog. Point your existing client at it and it works.
The list below matches the current managed model catalog, not a broader wishlist.
-
Qwen3.5 35B-A3B UD Q4_K_XL — supported on AMD RDNA4 32 GB and Apple Silicon
-
Qwen3.6 35B-A3B UD Q4_K_XL — experimental on AMD RDNA4 32 GB and Apple Silicon
-
OpenAI GPT-OSS 20B Q4_K_M — supported on Apple Silicon
-
Qwen3 8B Q4_K_M — supported on AMD RDNA4 32 GB and Apple Silicon
-
Gemma 4 31B Q4_K_M — supported on AMD RDNA4 32 GB and Apple Silicon
-
Gemma 4 12B (26B-A4B MoE) Q4_K_M — experimental on AMD RDNA4 32 GB and Apple Silicon
-
Use
zinc model list --jsonfor machine-readable model metadata -
Current throughput and latency numbers live on the public benchmarks page: zolotukhin.ai/zinc/benchmarks
Quantization formats: Q4_K, Q5_K, Q6_K, Q8_0, Q5_0, MXFP4, F16, F32
| Tool | Install |
|---|---|
| Zig 0.15.2+ | ziglang.org/download |
| Vulkan loader + tools | apt install libvulkan-dev vulkan-tools (Linux) or brew install vulkan-loader vulkan-headers (macOS) |
glslc on Linux |
apt install glslc |
| Bun for tests and the docs site | curl -fsSL https://bun.sh/install | bash |
Important: On Linux with RDNA4, newer glslc releases can cause a large regression. Use the system package version.
git clone https://github.com/zolotukhin/zinc.git
cd zinc
# Build the CLI and server
# macOS: shaders are skipped
# Linux: shaders are compiled automatically
zig build -Doptimize=ReleaseFastThe binary is placed in zig-out/bin/zinc. Compiled SPIR-V shaders go to zig-out/share/zinc/shaders/.
Use ReleaseFast for any performance measurement or server deployment. Plain zig build is not a fair throughput baseline.
Before your first prompt, run --check. The target state is a clean READY [OK] run with no warnings.
# General machine + Vulkan + shader preflight
./zig-out/bin/zinc --check
# Recommended on RDNA4 before measuring performance
export RADV_PERFTEST=coop_matrix
./zig-out/bin/zinc --check
# Check one exact GGUF file
./zig-out/bin/zinc --check -m /path/to/model.gguf
# Check one managed catalog model by id
./zig-out/bin/zinc --check --model-id qwen35-35b-a3b-q4k-xl--check verifies:
- host environment and RDNA4-specific shell hints
- compiled shader assets
- Vulkan device discovery and the selected GPU
- GGUF metadata when you pass
-m /path/to/model.gguf - managed-model compatibility when you pass
--model-id <id> - estimated single-GPU VRAM fit for the current runtime
If --check reports warnings, treat them as setup work to finish before judging runtime behavior. For the full walkthrough, see Running ZINC and Hardware requirements.
The README keeps the supported-model section concise and leaves the full managed-model workflow to the docs.
Use these for model selection, cache management, and API details:
./zig-out/bin/zinc -m /path/to/model.gguf --prompt "The capital of France is"Start the server — no --prompt flag means server mode:
./zig-out/bin/zinc -m /path/to/model.gguf -p 8080Then open http://localhost:8080/ in your browser for the built-in chat interface.
ZINC exposes an OpenAI-compatible API at /v1.
For the actual request examples and SDK usage, use the website docs instead of the README:
- Running ZINC for CLI, server mode, and first-run examples
- Serving HTTP API for
curl, OpenAI SDK examples, endpoint behavior, and response shapes
The built-in chat UI is served at /, the API is under /v1, and the health endpoint is /health.
For building, testing, debugging, benchmarking, graph export, and contributing — see the Development Guide (web version).
Quick start:
zig build -Doptimize=ReleaseFast # build
zig build test # run all tests
./zig-out/bin/zinc --check # verify GPU/runtime setupSee also: CONTRIBUTING.md · Code of Conduct
The tables below are pulled directly from the latest published artifact at zolotukhin.ai/zinc/benchmarks. Latest refresh: 2026-05-02 (both targets). Numbers are median tok/s across the suite's runs on a fresh boot, ZINC and llama.cpp on the same hardware, weights, and prompt.
| Model | ZINC prefill | llama.cpp prefill | ZINC % | ZINC decode | llama.cpp decode | ZINC % |
|---|---|---|---|---|---|---|
| Qwen 3 8B (dense) | 115.85 | 84.73 | 137% | 60.52 | 109.59 | 55% |
| Qwen 3.5 35B A3B (MoE+SSM) | 74.56 | 168.13 | 44% | 103.72 | 96.76 | 107% |
| Qwen 3.6 35B A3B (MoE+SSM) | 72.20 | 179.97 | 40% | 99.40 | 104.48 | 95% |
| Gemma 4 26B A4B (MoE) | 56.03 | 334.08 | 17% | 102.82 | 106.81 | 96% |
| Gemma 4 31B (dense) | 44.35 | 135.78 | 33% | 44.53 | 32.17 | 138% |
| GPT-OSS 20B (MoE) | 85.64 | 267.35 | 32% | 83.41 | 181.72 | 46% |
| Model | ZINC prefill | llama.cpp prefill | ZINC decode | llama.cpp decode | ZINC % decode |
|---|---|---|---|---|---|
| Qwen 3.5 35B A3B (MoE+SSM) | 29.5 | 124.42 | 46.36 | 77.70 | 60% |
| Qwen 3.6 35B A3B (MoE+SSM) | 25.2 | 130.32 | 38.13 | 79.88 | 48% |
| Gemma 4 12B (MoE) | 27.4 | 279.51 | 43.68 | 92.17 | 47% |
| Qwen 3 8B (dense) | 24.7 | 74.67 | 32.74 | 83.67 | 39% |
| GPT-OSS 20B (MoE) | 20.0 | 177.01 | 25.98 | 119.85 | 22% |
| Gemma 4 31B (dense) | 19.5 | 73.28 | 5.16 | 25.14 | 21% |
- Ahead of llama.cpp: Qwen 3 8B prefill on RDNA4 (1.37x), Qwen 3.5 35B-A3B decode on RDNA4 (1.07x), Gemma 4 31B dense decode on RDNA4 (1.38x).
- Within striking distance (≥90% of llama.cpp): Qwen 3.6 35B-A3B decode on RDNA4 (95%), Gemma 4 26B A4B decode on RDNA4 (96%).
- Active gap: Qwen 3.5/3.6 35B-A3B prefill on RDNA4 sits at ~40% of llama.cpp because the entire batched prefill path is gated off for any model with
n_experts > 0orssm_d_inner > 0. The wire-up that closes this is documented in the cycle-50 field report. - In flight: Metal prefill is uniformly bottlenecked (20–30 tok/s) because the per-token Metal path doesn't amortize weight reads across prompt tokens. Metal decode is improving fastest on the larger A3B models (46.4 tok/s on Qwen 3.5 35B-A3B); the Gemma 4 31B decode floor at 5.16 tok/s is the active optimization target.
For local benchmark commands, harnesses, and methodology, see:
| Component | Status |
|---|---|
| Vulkan infrastructure | Done |
| GGUF parser + model loader | Done |
| GPU detection (RDNA3/4) | Done |
| Native BPE tokenizer (from GGUF) | Done |
| GLSL compute shaders (16) | Done |
| Compute graph + architecture builders | Done |
| Forward pass (decode loop) | Working — 103.72 tok/s on RDNA4 and 46.36 tok/s on Apple M4 Max for Qwen 3.5 35B-A3B |
| Forward pass (prefill loop) | Working — 70+ tok/s on RDNA4 long-context for Qwen 3.5/3.6 35B-A3B; Metal prefill in flight |
| GPU SSM shaders + cmd batching | Done — RDNA decode is 99+ tok/s on Qwen 3.5/3.6 35B and 83 tok/s on GPT-OSS 20B |
| HTTP server + OpenAI API | Done — Qwen 35B-A3B raw API ~100 tok/s on RDNA4 and ~46 tok/s on Apple M4 Max |
| Continuous batching | Phase 4 |
| TurboQuant KV compression | Phase 5 |
Validated on AMD Radeon AI PRO R9700 (RDNA4): Vulkan 1.3 init, GGUF parsing, 21 GB model loaded to VRAM, 723-node MoE graph built, coherent inference output verified against CPU reference.
The next push is closing the prefill gap to llama.cpp on hybrid MoE-plus-SSM models:
- Wire
mul_mm_q4kinto SSM proj prefill — the tiled Q4_K GEMM is in the tree but only routes the language-model head where N=1 wastes the BN tile. The SSM proj fires 4 DMMVs per layer per token; batching them across the prompt is the deferred cycle-40 refactor. - Port the
gated_delta_net.cublock-resident state pattern — today every prompt token re-reads and re-writes the full 2 MB SSM state per layer. Loading state once per workgroup and walking all tokens inside the kernel collapses 18 GB of state DRAM traffic per prefill to 4 MB. - Open
canUseBatchedPrefillRdnafor MoE+SSM hybrids — the entire batched prefill body (flash_attn_batched,rope_batched,dmmv_q4k_batch_kpar) is gated off whenn_experts > 0orssm_d_inner > 0. Once items 1 and 2 land, dropping the gate activates Br-row attention batching on the same workload. - Land the cycle-50 micro-restructure pattern on MoE inner loops — wider threads-per-row plus halved per-thread register slabs lifted ssm_delta_net by 2.7%. The same shape change is untried on
dmmv_q4k_moe_kparanddmmv_q4k_moe_fused_down_acc. - Ship batched Metal prefill across the catalog — the Gemma path landed; Qwen 3.5/3.6 and GPT-OSS still route through the per-token Metal path that produces the 0.2–10 tok/s prefill numbers above.
The full plan and 50-cycle field report is in the cycle-50 blog post.
MIT
