Skip to content

14NGiestas/mfi

Repository files navigation

MFI

Modern Fortran interfaces to BLAS and LAPACK

MFI provides generic, type-agnostic wrappers around BLAS and LAPACK routines. Instead of writing type-specific calls with dozens of arguments, you write one call that works for real32, real64, complex(real32), and complex(real64).

Example: $C = A \cdot B$

program main
    use mfi_blas, only: mfi_gemm
    implicit none
    real :: A(4,4), B(4,4), C(4,4)
    ! ... fill A and B ...
    call mfi_gemm(A, B, C)   ! That's it. No leading dims, no m/n/k, no alpha/beta.
end program

Quick Start

Recommended: Nix Flake (zero config)

git clone https://github.com/14NGiestas/mfi.git
cd mfi
nix develop          # cpu-only shell with gfortran, fpm, fypp, BLAS, LAPACK
nix develop .#gpu-modern   # with CUDA 12.3
nix develop .#gpu-legacy   # with CUDA 11.8
make              # generates .f90 from .fpp/.fypp templates
fpm test          # runs the test suite

Requires Nix with flakes enabled.

Manual setup

Tool Minimum version
fpm ≥ 0.13.0
fypp any
Fortran compiler gfortran 12+ (recommended)
pip install fypp

Install BLAS and LAPACK from your package manager:

Distro Package
Arch openblas-lapack-static (AUR)
Ubuntu/Debian libblas-dev liblapack-dev
Fedora openblas-devel lapack-devel

Build & Test

git clone https://github.com/14NGiestas/mfi.git
cd mfi
make              # generates .f90 from .fpp/.fypp templates
fpm test          # runs the test suite

Using MFI as a Dependency

Add to your project's fpm.toml:

# CPU-only (stable)
[dependencies]
mfi = { git = "https://github.com/14NGiestas/mfi.git", branch = "mfi-fpm" }

That's all — fpm handles the rest. No make needed in your own project.


GPU Acceleration with cuBLAS

MFI can transparently dispatch BLAS calls to cuBLAS when compiled with the cublas feature. The same mfi_gemm, mfi_gemv, etc. calls run on the GPU without code changes.

Try it in your browser:

Open In Colab

Local build with cuBLAS

make
fpm build --profile cublas
fpm test --profile cublas

Runtime CPU / GPU switching

MFI uses lazy initialization — no setup code is needed. When compiled with the cublas feature, GPU dispatching is controlled entirely by the MFI_USE_CUBLAS environment variable:

# CPU (default)
./build/app/app

# GPU
MFI_USE_CUBLAS=1 ./build/app/app

The same call mfi_gemm(A, B, C) runs on CPU or GPU without any code changes.

For OpenMP-parallel programs, also set OMP_NUM_THREADS to pre-allocate per-thread cuBLAS handles:

MFI_USE_CUBLAS=1 OMP_NUM_THREADS=8 ./build/app/app

Manual CPU/GPU switching (advanced)

If you need fine-grained control within a single program (e.g., run most computations on GPU but force a specific call to CPU), use mfi_force_gpu / mfi_force_cpu:

call mfi_gemm(A, B, C)       ! CPU (default)

call mfi_force_gpu
call mfi_gemm(D, E, F)       ! GPU
call mfi_force_cpu

call mfi_gemm(G, H, I)       ! CPU again

Note: When compiled without the cublas feature, mfi_force_gpu and mfi_force_cpu are no-op stubs — your code compiles and runs normally on CPU without any #ifdef changes. Simply recompile with --profile cublas to activate GPU acceleration.

Clean shutdown (optional)

Call mfi_cublas_finalize() at program end to release GPU resources. The OS cleans up on exit anyway.


Troubleshooting

Problem Solution
CUBLAS_STATUS_NOT_INITIALIZED cuBLAS handle not created. Set MFI_USE_CUBLAS=1 or call mfi_force_gpu before the first BLAS call.
cuda_runtime.h not found CUDA Toolkit is not installed or not in your include path. See gpu_test.ipynb for a working Colab setup.
i?amin symbols missing Your BLAS provider lacks extensions. Use the default profile (without MFI_LINK_EXTERNAL) or switch to OpenBLAS.
Tests fail on CPU build Known pre-existing failures: cunmrq, sorg2r, sorgr2, cungr2, cung2r, sormrq, heevx (segfault).

Interface Levels

MFI exposes four interface levels for BLAS, from bare-metal to fully modern:

Level Example Arguments
Raw F77 call cgemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N) 13
Improved F77 call f77_gemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N) 13 (no c/d/s/z prefix)
MFI typed call mfi_sgemm(A, B, C) 3 (type-specific)
MFI generic call mfi_gemm(A, B, C) 3 (type-agnostic)

For full API documentation, see the generated reference.


Supported Routines

BLAS

Level 1

Click to expand
Status Name Description
👍 asum Sum of vector magnitudes
👍 axpy Scalar-vector product
👍 copy Copy vector
👍 dot Dot product
👍 dotc Dot product conjugated
👍 dotu Dot product unconjugated
f77 sdsdot Extended precision inner product
f77 dsdot Extended precision inner product with double result
👍 nrm2 Vector 2-norm (Euclidean norm)
👍 rot Plane rotation
👍 rotg Generate Givens rotation
👍 rotm Modified Givens rotation
👍 rotmg Generate modified Givens rotation
👍 scal Vector-scalar product
👍 swap Vector-vector swap

Level 1 — Extensions

Click to expand
Status Name Description
👍 iamax Index of maximum absolute value element
👍 iamin Index of minimum absolute value element
👍 lamch Machine precision parameters

Level 2

Click to expand
Status Name Description
👍 gbmv Matrix-vector product (general band)
👍 gemv Matrix-vector product (general)
👍 ger Rank-1 update (general)
👍 gerc Rank-1 update (general, conjugated)
👍 geru Rank-1 update (general, unconjugated)
👍 hbmv Matrix-vector product (Hermitian band)
👍 hemv Matrix-vector product (Hermitian)
👍 her Rank-1 update (Hermitian)
👍 her2 Rank-2 update (Hermitian)
👍 hpmv Matrix-vector product (Hermitian packed)
👍 hpr Rank-1 update (Hermitian packed)
👍 hpr2 Rank-2 update (Hermitian packed)
👍 sbmv Matrix-vector product (symmetric band)
👍 spmv Matrix-vector product (symmetric packed)
👍 spr Rank-1 update (symmetric packed)
👍 spr2 Rank-2 update (symmetric packed)
👍 symv Matrix-vector product (symmetric)
👍 syr Rank-1 update (symmetric)
👍 syr2 Rank-2 update (symmetric)
👍 tbmv Matrix-vector product (triangular band)
👍 tbsv Solve (triangular band)
👍 tpmv Matrix-vector product (triangular packed)
👍 tpsv Solve (triangular packed)
👍 trmv Matrix-vector product (triangular)
👍 trsv Solve (triangular)

Level 3

Click to expand
Status GPU Name Description
👍 gemm General matrix-matrix product
👍 hemm Hermitian × general matrix product
👍 herk Hermitian rank-k update
👍 her2k Hermitian rank-2k update
👍 symm Symmetric × general matrix product
👍 syrk Symmetric rank-k update
👍 syr2k Symmetric rank-2k update
👍 trmm Triangular × general matrix product
👍 trsm Solve with triangular matrix

LAPACK

LAPACK coverage is growing — routines are implemented as needed.

Factorization and Solve

Click to expand
Status Name Description
👍 geqrf QR factorization
👍 gerqf RQ factorization
👍 getrf LU factorization
👍 getri Matrix inverse (from LU)
👍 getrs Solve with LU-factored matrix
👍 hetrf Bunch-Kaufman factorization (Hermitian)
👍 pocon Condition number estimate (Cholesky)
👍 potrf Cholesky factorization
👍 potri Matrix inverse (from Cholesky)
👍 potrs Solve with Cholesky-factored matrix
👍 sytrf Bunch-Kaufman factorization (symmetric)
👍 trtrs Solve with triangular matrix

Orthogonal / Unitary Factors

Click to expand
Status Name Description
👍 orgqr Generate Q from QR (real)
👍 orgrq Generate Q from RQ (real)
👍 ormqr Multiply by Q from QR (real)
f77 ormrq Multiply by Q from RQ (real)
👍 org2r Generate Q from QR2 (real)
👍 orm2r Multiply by Q from QR2 (real)
👍 orgr2 Generate Q from RQ2 (real)
👍 ormr2 Multiply by Q from RQ2 (real)
👍 ungqr Generate Q from QR (complex)
👍 ungrq Generate Q from RQ (complex)
👍 unmqr Multiply by Q from QR (complex)
f77 unmrq Multiply by Q from RQ (complex)
👍 ung2r Generate Q from QR2 (complex)
👍 unm2r Multiply by Q from QR2 (complex)
👍 ungr2 Generate Q from RQ2 (complex)
👍 unmr2 Multiply by Q from RQ2 (complex)

Eigenvalues and SVD

Click to expand
Status Name Description
👍 gesvd Singular value decomposition
👍 heevd Hermitian eigenvalues (divide & conquer)
👍 hegvd Generalized Hermitian eigenproblem (divide & conquer)
👍 heevr Hermitian eigenvalues (relatively robust)
f77 heevx Hermitian eigenvalues (expert)

Least Squares

Click to expand
Status Name Description
f77 gels Least squares (QR/LQ)
f77 gelst Least squares (QR/LQ, T matrix)
f77 gelss Least squares (SVD, QR iteration)
f77 gelsd Least squares (SVD, divide & conquer)
f77 gelsy Least squares (complete orthogonal)
f77 getsls Least squares (tall-skinny QR/LQ)
f77 gglse Equality-constrained least squares
f77 ggglm Gauss-Markov linear model

Auxiliary

Name Types Description
mfi_lartg s, d, c, z Generate plane rotation

Continuous Integration

CI uses Nix flakes with magic-nix-cache-action for fast, reproducible builds.

Event Behavior
Push to main Full test matrix + deploy to mfi-fpm
Push to impl/cublas Full test matrix + deploy to mfi-cublas
PR to main Full test matrix
Manual dispatch Full test matrix