CUDA GEMM optimization tutorial and mini inference engine
From naive matrix multiplication to ~85% cuBLAS-class performance on the reference benchmark
English · 简体中文 · Online Docs · Quick Start
Mini-Inference Engine is a compact CUDA/C++17 project for learning high-performance GEMM optimization in a realistic inference-engine setting. It keeps the scope intentionally small: matrix multiplication kernels, runtime utilities, benchmarks, tests, and bilingual documentation all live in one traceable codebase.
Core value:
| Area | What to inspect |
|---|---|
| GEMM kernels | src/naive_matmul.cu through src/vectorized_gemm.cu show the optimization path. |
| Runtime components | include/tensor.h, include/inference_engine.h, include/memory_pool.h, and include/stream_manager.h. |
| Benchmarks | benchmarks/benchmark.cpp, benchmarks/detailed_benchmark.cu, and benchmarks/mnist_demo.cpp. |
| Specs | openspec/specs/ defines requirements, architecture, API, data, and testing expectations. |
| Documentation | docs/en/ and docs/zh/ provide the tutorial, architecture, API, and tuning guides. |
The headline performance number is hardware-specific. The project uses a conservative reference claim: the best optimized kernel reaches about 85% of cuBLAS-class throughput on the documented RTX 3080 1024×1024 benchmark.
Requirements: CUDA Toolkit 11.0+, CMake 3.18+, a C++17 compiler, and an NVIDIA GPU with compute capability 7.0+.
git clone https://github.com/LessUp/mini-inference-engine.git
cd mini-inference-engine
cmake --preset default
cmake --build --preset default
ctest --preset default --output-on-failure
cmake --preset release
cmake --build --preset release
./build-release/benchmarkGPU tests skip when no CUDA device is available, but building still requires a CUDA toolkit because the library is compiled as a CUDA project.
| Topic | English | 中文 |
|---|---|---|
| Quick Start | docs/en/QUICK_START.md | docs/zh/QUICK_START.md |
| Architecture | docs/en/ARCHITECTURE.md | docs/zh/ARCHITECTURE.md |
| GEMM Optimization | docs/en/GEMM_OPTIMIZATION.md | docs/zh/GEMM_OPTIMIZATION.md |
| Performance Tuning | docs/en/PERFORMANCE_TUNING.md | docs/zh/PERFORMANCE_TUNING.md |
| API Reference | docs/en/API_REFERENCE.md | docs/zh/API_REFERENCE.md |
| Development Guide | docs/en/CONTRIBUTING.md | docs/zh/CONTRIBUTING.md |
- Source of truth:
openspec/specs/**. - Build system: explicit source lists in
CMakeLists.txt; do not use recursive globbing for source files. - Formatting:
.clang-formatwith Google-based 4-space style. - Tests:
tests_hostcovers utilities that do not require a GPU device;tests_gpucovers CUDA runtime/kernel behavior. Configuring and compiling the project still requires a CUDA Toolkit. - Branching: keep
masteras the only long-lived branch; use short-lived branches/worktrees for changes and delete them after merge.
See AGENTS.md for the full project-specific AI and engineering workflow.