Add Qwen3.5 Moe lite awq#4561
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends lmdeploy.lite AWQ quantization to better support Qwen3/Qwen3.5 MoE variants by adding a no-calibration (“data-free”) workflow, new module-skip options, and utilities to convert fused MoE experts into an unfused ModuleList form for module-wise quantization.
Changes:
- Add
--no-calib-ds-req(skip calibration) and--mod-skip-quant(skip specific modules) to the lite AWQ CLI/API. - Add MoE expert conversion utilities and new MLP expert module definitions (Qwen/Mixtral) for unfusing experts.
- Extend calibration/AWQ layer-type mappings for Qwen3 MoE and Qwen3.5 model families.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| lmdeploy/lite/utils/convert_moe_params.py | New utility to convert fused MoE expert weights into ModuleList experts. |
| lmdeploy/lite/utils/batch_split.py | Update input splitting logic for position_embeddings to handle Qwen3.5 shapes. |
| lmdeploy/lite/utils/init.py | Export convert_moe_parameters from lite utils. |
| lmdeploy/lite/quantization/awq.py | Add Qwen3 MoE norm mapping; refine skip-module logic; allow extra skip patterns. |
| lmdeploy/lite/mlp_moe_modules/qwen.py | Add Qwen MoE expert MLP module for unfused expert representation. |
| lmdeploy/lite/mlp_moe_modules/mixtral.py | Add Mixtral MoE expert MLP module for unfused expert representation. |
| lmdeploy/lite/mlp_moe_modules/base.py | Introduce a registry for MoE expert conversion modules. |
| lmdeploy/lite/mlp_moe_modules/init.py | Import/register MoE module implementations. |
| lmdeploy/lite/apis/calibrate.py | Extend model-type maps for Qwen3 MoE / Qwen3.5 (layer/norm/head mapping). |
| lmdeploy/lite/apis/auto_awq.py | Add data-free quant path; MoE conversion invocation; pass skip patterns into quantization. |
| lmdeploy/cli/lite.py | Add CLI flags for no-calibration quantization and module-skip patterns. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| from mmengine import Registry | ||
|
|
||
| CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules.base']) |
There was a problem hiding this comment.
CONVERT_MOE_MODELS is initialized with locations=['lmdeploy.lite.mlp_moe_modules.base'], but the actual registrations are in lmdeploy/lite/mlp_moe_modules/qwen.py and mixtral.py. Since base.py doesn't import those modules, the registry will stay empty unless callers import lmdeploy.lite.mlp_moe_modules elsewhere, causing CONVERT_MOE_MODELS.get(...) to return None and MoE conversion to silently never run. Consider changing locations to ['lmdeploy.lite.mlp_moe_modules'] (package) or importing the concrete modules in base.py/package init so the registrations are guaranteed to execute.
| CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules.base']) | |
| CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules']) |
| tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True) | ||
| model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True) | ||
| vl_model = None | ||
| work_dir = Path(work_dir) | ||
| work_dir.mkdir(parents=True, exist_ok=True) | ||
| return vl_model, model, tokenizer, work_dir | ||
|
|
||
|
|
There was a problem hiding this comment.
The no-calibration path (load_model) always uses load_hf_from_pretrained, which loads via AutoModelForCausalLM and always returns vl_model=None. This bypasses the VLM loading logic used in calibrate() (load_vl_model, language_model/llm extraction, etc.), so --no-calib-ds-req is likely to break for VLM/conditional-generation architectures such as Qwen3_5* (and will also skip save_vl_model). Suggest reusing get_task(...) + the same VLM/LLM loading branch as calibrate() (but skipping dataset calibration), rather than a separate loader.
| tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True) | |
| model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True) | |
| vl_model = None | |
| work_dir = Path(work_dir) | |
| work_dir.mkdir(parents=True, exist_ok=True) | |
| return vl_model, model, tokenizer, work_dir | |
| work_dir = Path(work_dir) | |
| work_dir.mkdir(parents=True, exist_ok=True) | |
| vl_model = None | |
| tokenizer = None | |
| llm_model = None | |
| # Reuse the task-aware VLM loading path when available so the no-calib | |
| # flow can handle conditional-generation / multimodal architectures. | |
| try: | |
| from lmdeploy.archs import get_task | |
| from lmdeploy.vl.model.builder import load_vl_model | |
| _, pipeline_class = get_task(model) | |
| is_vl_task = pipeline_class is not None and hasattr(pipeline_class, 'is_vl') and pipeline_class.is_vl | |
| if is_vl_task: | |
| vl_model = load_vl_model(model, backend='huggingface') | |
| if hasattr(vl_model, 'language_model'): | |
| llm_model = vl_model.language_model | |
| elif hasattr(vl_model, 'llm'): | |
| llm_model = vl_model.llm | |
| else: | |
| raise AttributeError('Cannot find language model in loaded VLM.') | |
| if hasattr(vl_model, 'tokenizer') and vl_model.tokenizer is not None: | |
| tokenizer = vl_model.tokenizer | |
| elif hasattr(vl_model, 'processor') and hasattr(vl_model.processor, 'tokenizer'): | |
| tokenizer = vl_model.processor.tokenizer | |
| except Exception: | |
| # Fall back to the original text-only loading path if task-aware VLM | |
| # loading is unavailable or the model is not a VLM. | |
| vl_model = None | |
| llm_model = None | |
| tokenizer = None | |
| if tokenizer is None: | |
| tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True) | |
| if llm_model is None: | |
| llm_model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True) | |
| return vl_model, llm_model, tokenizer, work_dir |
| else: | ||
| vl_model, model, tokenizer, work_dir = load_model(model, dtype, work_dir) | ||
|
|
||
| layer_type = LAYER_TYPE_MAP[type(model).__name__] |
There was a problem hiding this comment.
When calib_ds_req=False, this code bypasses calibrate()'s supported-model validation and then does layer_type = LAYER_TYPE_MAP[type(model).__name__], which will raise a raw KeyError for unsupported/renamed model classes. Consider adding the same explicit check and user-facing RuntimeError message that calibrate() uses (or reusing calibrate()'s model-type validation) so failures are actionable.
| layer_type = LAYER_TYPE_MAP[type(model).__name__] | |
| model_type = type(model).__name__ | |
| if model_type not in LAYER_TYPE_MAP: | |
| supported_model_types = ', '.join(sorted(LAYER_TYPE_MAP.keys())) | |
| raise RuntimeError( | |
| f'Unsupported model type: {model_type}. ' | |
| f'Supported model types are: {supported_model_types}.') | |
| layer_type = LAYER_TYPE_MAP[model_type] |
| 'Qwen3MoeForCausalLM': 'Qwen3MoeDecoderLayer', | ||
| 'Qwen3_5ForCausalLM': 'Qwen3_5DecoderLayer', | ||
| 'Qwen3_5MoeForCausalLM': 'Qwen3_5MoeDecoderLayer', |
There was a problem hiding this comment.
These new entries use Qwen3_5ForCausalLM / Qwen3_5MoeForCausalLM, but elsewhere in the repo Qwen3.5 is represented as Qwen3_5ForConditionalGeneration / Qwen3_5MoeForConditionalGeneration (e.g. lmdeploy/vl/model/qwen3_5.py, turbomind/supported_models.py). With the current keys, type(model).__name__ will likely never match and calibration/quantization will fail. Please align the map keys with the actual architecture class names (and update the corresponding NORM_TYPE_MAP / HEAD_NAME_MAP entries too).
| 'Qwen3MoeDecoderLayer': { | ||
| 'input_layernorm': ['self_attn.k_proj', 'self_attn.q_proj', 'self_attn.v_proj'], | ||
| 'post_attention_layernorm': ['mlp.gate_proj', 'mlp.up_proj'] | ||
| }, |
There was a problem hiding this comment.
NORM_FCS_MAP adds support for Qwen3MoeDecoderLayer, but FC_FCS_MAP has no corresponding Qwen3MoeDecoderLayer entry. In auto_awq with calib_ds_req=True (default), it does fc2fcs = FC_FCS_MAP[layer_type], so Qwen3-MoE will currently hit a KeyError during smoothing/AWQ. Similarly, the newly added Qwen3.5 layer types in calibrate.py need entries in both maps to work with calibration-based AWQ.
|
|
||
| return False | ||
|
|
There was a problem hiding this comment.
config_contains_keyword claims to recursively search config keys or string values, but the implementation only recurses into dict values and ignores lists/tuples/strings entirely. For HF configs, to_dict() commonly contains nested lists/dicts, so this can incorrectly return False and prevent MoE detection/conversion. Update search() to handle dict, list/tuple, and str (and optionally other primitive types via str(obj)), consistent with the docstring.
| return False | |
| return False | |
| if isinstance(obj, (list, tuple)): | |
| for item in obj: | |
| if search(item): | |
| return True | |
| return False | |
| if isinstance(obj, str): | |
| return keyword in obj.lower() | |
| if obj is None: | |
| return False | |
| return keyword in str(obj).lower() |
| search_scale=search_scale, | ||
| dtype=dtype, | ||
| batch_size=batch_size) | ||
| input_stats = torch.load(osp.join(work_dir, 'inputs_stats.pth'), weights_only=True) |
There was a problem hiding this comment.
input_stats is loaded here but never used (it gets reloaded inside the later if calib_ds_req: block). This is unnecessary IO and can noticeably slow down quantization for large stats files; consider removing this load or using the already-loaded input_stats later.
| input_stats = torch.load(osp.join(work_dir, 'inputs_stats.pth'), weights_only=True) |
| is_moe = ( | ||
| 'moe' in model.config.model_type.lower() or | ||
| config_contains_keyword(model.config, 'experts') | ||
| ) | ||
| for l_name, layer in layers.items(): | ||
| if is_moe: | ||
| convert_moe_parameters(model_path, layer) | ||
| name2fc = collect_target_modules(layer, nn.Linear, prefix=l_name) |
There was a problem hiding this comment.
convert_moe_parameters(model_path, layer) is called once per decoder layer, but convert_moe_parameters recomputes the registered model name from model_path each time (which calls get_model_arch(...) and reads config). This becomes O(num_layers) config parsing overhead. Consider computing model_name once in auto_awq (or once in convert_moe_parameters via caching) and passing it down so conversion stays cheap.
| def match_builtin_skkiped_pattern(name: str, pattern: str): | ||
| if pattern == 'lora': | ||
| return pattern in name | ||
| return name == pattern or name.endswith(f'.{pattern}') or f'.{pattern}.' in name |
There was a problem hiding this comment.
Typo in helper name: match_builtin_skkiped_pattern has an extra 'k' ("skkiped") and also contains a double space in if pattern == 'lora':. Renaming to match_builtin_skipped_pattern (and updating call sites) would improve readability and avoid propagating the typo into future usages.
| parser.add_argument('--no-calib-ds-req', | ||
| dest='calib_ds_req', | ||
| action='store_false', | ||
| default=True, | ||
| help='Require calibration dataset before quantizing weights. ' | ||
| 'Default to True. Set to False to skip calibration and directly quantize weights') |
There was a problem hiding this comment.
It's unnessary to define another option.
We can use "calib_samples=0" to indicate the data free quantization
| default=True, | ||
| help='Require calibration dataset before quantizing weights. ' | ||
| 'Default to True. Set to False to skip calibration and directly quantize weights') | ||
| parser.add_argument('--mod-skip-quant', |
There was a problem hiding this comment.
--mod-skip-quant is awkward. dest is redudant
parser.add_argument('--exclude-modules',
nargs='+',
metavar='PATTERN',
default=None,
help='One or more module name patterns (glob‑style) to exclude from quantization. '
'Example: --exclude-modules "*.lm_head" "transformer.layers.*.ffn"'')| save_vl_model(vl_model, model_path, work_dir) | ||
| else: | ||
| model.save_pretrained(work_dir, safe_serialization=True) | ||
| # model.save_pretrained(work_dir, safe_serialization=True) |
There was a problem hiding this comment.
May remove the unused code
| def convert_experts(experts_mod: nn.Module, moemlp_cls) -> nn.ModuleList: | ||
| """Convert fused MoE expert weights into a ModuleList of MLP experts | ||
| without copying.""" | ||
| num_experts, intermediate_size_2, hidden_size = experts_mod.gate_up_proj.shape |
There was a problem hiding this comment.
Is it specifically for Qwen3-MoE?
| @@ -0,0 +1,3 @@ | |||
| # Copyright (c) OpenMMLab. All rights reserved. | |||
| from .mixtral import MixtralMoeMLP # noqa: F401 | |||
There was a problem hiding this comment.
"mlp_moe_modules" -> "moe_mlp_modules"
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Add lite awq quant for qwen3-moe, qwen3.5-dense and qwen3.5-moe model. This PR only supports data free awq quant for qwen3-moe, qwen3.5-dense and qwen3.5-moe model.
Modification
lmdeploy/lmdeploy/lite/apis/auto_awq.py: Add data free (no calibration) quant.
lmdeploy/lmdeploy/lite/mlp_moe_modules: Add convert moe module to moduelist format when transformers≥5.0.
lmdeploy/lmdeploy/lite/utils/convert_moe_params.py: The same with mlp_moe_modules.
Use cases (Optional)
Qwen3-moe
Qwen3.5-dense
Qwen3.5-moe
Checklist