GateControl: Efficient and Flexible Controllable Generation for Linear-Attention Diffusion Models
Accepted to CVPR 2026
This repository contains the official PyTorch implementation of GateControl, an efficient and flexible controllable generation framework tailored for linear-attention diffusion backbones (such as SANA).
Beyond providing a lightweight tool for on-device deployment, the core contribution of this work lies in a deeper insight. A common belief in prior work is that naïve additive fusion of conditional features breaks down on non-spatially aligned tasks (like subject-driven generation). Our findings challenge this assumption: with our proposed token-wise gated modulation, simple additive fusion remains highly robust for subject-driven control, while simultaneously driving a dramatic acceleration in convergence for spatial tasks.
Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models.
To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.
- Unifying Control Mechanism: Challenges the common assumption that additive fusion breaks down on non-spatial tasks. Our token-wise gated modulation perfectly unifies robust control across both spatially aligned (Canny, Coloring, Deblurring, Depth, HED) and spatially unaligned (Subject-driven) conditions.
- Dramatic Convergence Acceleration: Brings a significant advantage to spatial tasks by radically accelerating the optimization process (e.g., achieving convergence in ~1k steps vs. 10k steps under strict apples-to-apples comparisons).
- Extreme Efficiency (~0.09M Params): Validating this intrinsic design yields an exceptionally lightweight control module requiring only ~0.09M additional parameters, guaranteeing negligible computational overhead.
- Tailored for Linear Attention (e.g., SANA): By operating as a learnable gate, it effectively mitigates the information compression inherent in linear-attention interactions, making it ideal for robust on-device generation. Importantly, this intrinsic insight extends beyond a backbone-specific engineering solution and yields consistent benefits even under standard softmax attention.
To train GateControl, simply navigate and adjust parameters inside train_scripts/train_sana_gatecontrol.sh:
export HF_HOME='./cache'
export XDG_CACHE_HOME='./cache'
CONDITION_TYPE="COLORING" # Choices: SUBJECT, CANNY, COLORING, DEBLURRING, DEPTH, HED
RESOLUTION=512 # e.g., 512 or 1024
MODEL_NAME="/path/to/SANA_ckpt"
DATASET_PATH="/path/to/dataset"
accelerate launch --config_file ./accelerate_config.json train_sana_gatecontrol.py ...Use our decoupled inference logic module to test image generations locally:
import torch
from generate.pipeline import sana_pipeline_gatecontrol
from model.sana_gatecontrol import SanaTransformer2DModelGateControl
# Implement your conditioned mappings via the provided pipeline abstraction If you find our work useful in your research or project, please consider citing:
@article{liu2026gated,
title={Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers},
author={Liu, Yuhe and Tan, Zhenxiong and Hu, Yujia and Liu, Songhua and Wang, Xinchao},
journal={arXiv preprint arXiv:2603.27666},
year={2026}
}