Add Qwen3.5 Moe lite awq by 43758726 · Pull Request #4561 · InternLM/lmdeploy

43758726 · 2026-04-28T16:12:58Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Add lite awq quant for qwen3-moe, qwen3.5-dense and qwen3.5-moe model. This PR only supports data free awq quant for qwen3-moe, qwen3.5-dense and qwen3.5-moe model.

Modification

lmdeploy/lmdeploy/lite/apis/auto_awq.py: Add data free (no calibration) quant.
lmdeploy/lmdeploy/lite/mlp_moe_modules: Add convert moe module to moduelist format when transformers≥5.0.
lmdeploy/lmdeploy/lite/utils/convert_moe_params.py: The same with mlp_moe_modules.

Use cases (Optional)

Qwen3-moe

lmdeploy lite auto_awq 'model_path/repo' ----no-calib-ds-req

Qwen3.5-dense

lmdeploy lite auto_awq 'model_path/repo' ----no-calib-ds-req --mod-skip-quant visual linear_attn self_attn model.layers.0. mtp

Qwen3.5-moe

lmdeploy lite auto_awq 'model_path/repo' ----no-calib-ds-req --mod-skip-quant visual linear_attn self_attn model.layers.0. mtp shared_expert

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Copilot

Pull request overview

This PR extends lmdeploy.lite AWQ quantization to better support Qwen3/Qwen3.5 MoE variants by adding a no-calibration (“data-free”) workflow, new module-skip options, and utilities to convert fused MoE experts into an unfused ModuleList form for module-wise quantization.

Changes:

Add --no-calib-ds-req (skip calibration) and --mod-skip-quant (skip specific modules) to the lite AWQ CLI/API.
Add MoE expert conversion utilities and new MLP expert module definitions (Qwen/Mixtral) for unfusing experts.
Extend calibration/AWQ layer-type mappings for Qwen3 MoE and Qwen3.5 model families.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
lmdeploy/lite/utils/convert_moe_params.py	New utility to convert fused MoE expert weights into `ModuleList` experts.
lmdeploy/lite/utils/batch_split.py	Update input splitting logic for `position_embeddings` to handle Qwen3.5 shapes.
lmdeploy/lite/utils/init.py	Export `convert_moe_parameters` from lite utils.
lmdeploy/lite/quantization/awq.py	Add Qwen3 MoE norm mapping; refine skip-module logic; allow extra skip patterns.
lmdeploy/lite/mlp_moe_modules/qwen.py	Add Qwen MoE expert MLP module for unfused expert representation.
lmdeploy/lite/mlp_moe_modules/mixtral.py	Add Mixtral MoE expert MLP module for unfused expert representation.
lmdeploy/lite/mlp_moe_modules/base.py	Introduce a registry for MoE expert conversion modules.
lmdeploy/lite/mlp_moe_modules/init.py	Import/register MoE module implementations.
lmdeploy/lite/apis/calibrate.py	Extend model-type maps for Qwen3 MoE / Qwen3.5 (layer/norm/head mapping).
lmdeploy/lite/apis/auto_awq.py	Add data-free quant path; MoE conversion invocation; pass skip patterns into quantization.
lmdeploy/cli/lite.py	Add CLI flags for no-calibration quantization and module-skip patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T16:21:03Z

+
+from mmengine import Registry
+
+CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules.base'])


CONVERT_MOE_MODELS is initialized with locations=['lmdeploy.lite.mlp_moe_modules.base'], but the actual registrations are in lmdeploy/lite/mlp_moe_modules/qwen.py and mixtral.py. Since base.py doesn't import those modules, the registry will stay empty unless callers import lmdeploy.lite.mlp_moe_modules elsewhere, causing CONVERT_MOE_MODELS.get(...) to return None and MoE conversion to silently never run. Consider changing locations to ['lmdeploy.lite.mlp_moe_modules'] (package) or importing the concrete modules in base.py/package init so the registrations are guaranteed to execute.

Suggested change

CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules.base'])

CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules'])

Copilot · 2026-04-28T16:21:04Z

+    tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
+    model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True)
+    vl_model = None
+    work_dir = Path(work_dir)
+    work_dir.mkdir(parents=True, exist_ok=True)
+    return vl_model, model, tokenizer, work_dir
+
+


The no-calibration path (load_model) always uses load_hf_from_pretrained, which loads via AutoModelForCausalLM and always returns vl_model=None. This bypasses the VLM loading logic used in calibrate() (load_vl_model, language_model/llm extraction, etc.), so --no-calib-ds-req is likely to break for VLM/conditional-generation architectures such as Qwen3_5* (and will also skip save_vl_model). Suggest reusing get_task(...) + the same VLM/LLM loading branch as calibrate() (but skipping dataset calibration), rather than a separate loader.

Suggested change

tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)

model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True)

vl_model = None

work_dir = Path(work_dir)

work_dir.mkdir(parents=True, exist_ok=True)

return vl_model, model, tokenizer, work_dir

work_dir = Path(work_dir)

work_dir.mkdir(parents=True, exist_ok=True)

vl_model = None

tokenizer = None

llm_model = None

# Reuse the task-aware VLM loading path when available so the no-calib

# flow can handle conditional-generation / multimodal architectures.

try:

from lmdeploy.archs import get_task

from lmdeploy.vl.model.builder import load_vl_model

_, pipeline_class = get_task(model)

is_vl_task = pipeline_class is not None and hasattr(pipeline_class, 'is_vl') and pipeline_class.is_vl

if is_vl_task:

vl_model = load_vl_model(model, backend='huggingface')

if hasattr(vl_model, 'language_model'):

llm_model = vl_model.language_model

elif hasattr(vl_model, 'llm'):

llm_model = vl_model.llm

else:

raise AttributeError('Cannot find language model in loaded VLM.')

if hasattr(vl_model, 'tokenizer') and vl_model.tokenizer is not None:

tokenizer = vl_model.tokenizer

elif hasattr(vl_model, 'processor') and hasattr(vl_model.processor, 'tokenizer'):

tokenizer = vl_model.processor.tokenizer

except Exception:

# Fall back to the original text-only loading path if task-aware VLM

# loading is unavailable or the model is not a VLM.

vl_model = None

llm_model = None

tokenizer = None

if tokenizer is None:

tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)

if llm_model is None:

llm_model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True)

return vl_model, llm_model, tokenizer, work_dir

Copilot · 2026-04-28T16:21:04Z

+    else:
+        vl_model, model, tokenizer, work_dir = load_model(model, dtype, work_dir)

    layer_type = LAYER_TYPE_MAP[type(model).__name__]


When calib_ds_req=False, this code bypasses calibrate()'s supported-model validation and then does layer_type = LAYER_TYPE_MAP[type(model).__name__], which will raise a raw KeyError for unsupported/renamed model classes. Consider adding the same explicit check and user-facing RuntimeError message that calibrate() uses (or reusing calibrate()'s model-type validation) so failures are actionable.

Suggested change

layer_type = LAYER_TYPE_MAP[type(model).__name__]

model_type = type(model).__name__

if model_type not in LAYER_TYPE_MAP:

supported_model_types = ', '.join(sorted(LAYER_TYPE_MAP.keys()))

raise RuntimeError(

f'Unsupported model type: {model_type}. '

f'Supported model types are: {supported_model_types}.')

layer_type = LAYER_TYPE_MAP[model_type]

Copilot · 2026-04-28T16:21:05Z

+    'Qwen3MoeForCausalLM': 'Qwen3MoeDecoderLayer',
+    'Qwen3_5ForCausalLM': 'Qwen3_5DecoderLayer',
+    'Qwen3_5MoeForCausalLM': 'Qwen3_5MoeDecoderLayer',


These new entries use Qwen3_5ForCausalLM / Qwen3_5MoeForCausalLM, but elsewhere in the repo Qwen3.5 is represented as Qwen3_5ForConditionalGeneration / Qwen3_5MoeForConditionalGeneration (e.g. lmdeploy/vl/model/qwen3_5.py, turbomind/supported_models.py). With the current keys, type(model).__name__ will likely never match and calibration/quantization will fail. Please align the map keys with the actual architecture class names (and update the corresponding NORM_TYPE_MAP / HEAD_NAME_MAP entries too).

Copilot · 2026-04-28T16:21:05Z

+    'Qwen3MoeDecoderLayer': {
+        'input_layernorm': ['self_attn.k_proj', 'self_attn.q_proj', 'self_attn.v_proj'],
+        'post_attention_layernorm': ['mlp.gate_proj', 'mlp.up_proj']
+    },


NORM_FCS_MAP adds support for Qwen3MoeDecoderLayer, but FC_FCS_MAP has no corresponding Qwen3MoeDecoderLayer entry. In auto_awq with calib_ds_req=True (default), it does fc2fcs = FC_FCS_MAP[layer_type], so Qwen3-MoE will currently hit a KeyError during smoothing/AWQ. Similarly, the newly added Qwen3.5 layer types in calibrate.py need entries in both maps to work with calibration-based AWQ.

Copilot · 2026-04-28T16:21:05Z

+
+        return False
+


config_contains_keyword claims to recursively search config keys or string values, but the implementation only recurses into dict values and ignores lists/tuples/strings entirely. For HF configs, to_dict() commonly contains nested lists/dicts, so this can incorrectly return False and prevent MoE detection/conversion. Update search() to handle dict, list/tuple, and str (and optionally other primitive types via str(obj)), consistent with the docstring.

Suggested change

return False

return False

if isinstance(obj, (list, tuple)):

for item in obj:

if search(item):

return True

return False

if isinstance(obj, str):

return keyword in obj.lower()

if obj is None:

return False

return keyword in str(obj).lower()

Copilot · 2026-04-28T16:21:06Z

+                                                         search_scale=search_scale,
+                                                         dtype=dtype,
+                                                         batch_size=batch_size)
+        input_stats = torch.load(osp.join(work_dir, 'inputs_stats.pth'), weights_only=True)


input_stats is loaded here but never used (it gets reloaded inside the later if calib_ds_req: block). This is unnecessary IO and can noticeably slow down quantization for large stats files; consider removing this load or using the already-loaded input_stats later.

Suggested change

input_stats = torch.load(osp.join(work_dir, 'inputs_stats.pth'), weights_only=True)

Copilot · 2026-04-28T16:21:06Z

+    is_moe = (
+        'moe' in model.config.model_type.lower() or
+        config_contains_keyword(model.config, 'experts')
+    )
    for l_name, layer in layers.items():
+        if is_moe:
+            convert_moe_parameters(model_path, layer)
        name2fc = collect_target_modules(layer, nn.Linear, prefix=l_name)


convert_moe_parameters(model_path, layer) is called once per decoder layer, but convert_moe_parameters recomputes the registered model name from model_path each time (which calls get_model_arch(...) and reads config). This becomes O(num_layers) config parsing overhead. Consider computing model_name once in auto_awq (or once in convert_moe_parameters via caching) and passing it down so conversion stays cheap.

Copilot · 2026-04-28T16:21:06Z

+def match_builtin_skkiped_pattern(name: str, pattern: str):
+    if pattern  == 'lora':
+        return pattern in name
+    return name == pattern or name.endswith(f'.{pattern}') or f'.{pattern}.' in name


Typo in helper name: match_builtin_skkiped_pattern has an extra 'k' ("skkiped") and also contains a double space in if pattern == 'lora':. Renaming to match_builtin_skipped_pattern (and updating call sites) would improve readability and avoid propagating the typo into future usages.

lvhan028 · 2026-04-29T05:04:37Z

+        parser.add_argument('--no-calib-ds-req',
+                            dest='calib_ds_req',
+                            action='store_false',
+                            default=True,
+                            help='Require calibration dataset before quantizing weights. '
+                                 'Default to True. Set to False to skip calibration and directly quantize weights')


It's unnessary to define another option.
We can use "calib_samples=0" to indicate the data free quantization

lvhan028 · 2026-04-29T05:17:02Z

+                            default=True,
+                            help='Require calibration dataset before quantizing weights. '
+                                 'Default to True. Set to False to skip calibration and directly quantize weights')
+        parser.add_argument('--mod-skip-quant',


--mod-skip-quant is awkward. dest is redudant

parser.add_argument('--exclude-modules', nargs='+', metavar='PATTERN', default=None, help='One or more module name patterns (glob‑style) to exclude from quantization. ' 'Example: --exclude-modules "*.lm_head" "transformer.layers.*.ffn"'')

lvhan028 · 2026-04-29T05:24:32Z

        save_vl_model(vl_model, model_path, work_dir)
    else:
-        model.save_pretrained(work_dir, safe_serialization=True)
+        # model.save_pretrained(work_dir, safe_serialization=True)


May remove the unused code

lvhan028 · 2026-04-29T05:30:59Z

+def convert_experts(experts_mod: nn.Module, moemlp_cls) -> nn.ModuleList:
+    """Convert fused MoE expert weights into a ModuleList of MLP experts
+    without copying."""
+    num_experts, intermediate_size_2, hidden_size = experts_mod.gate_up_proj.shape


Is it specifically for Qwen3-MoE?

lvhan028 · 2026-04-29T05:36:40Z

@@ -0,0 +1,3 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .mixtral import MixtralMoeMLP  # noqa: F401


"mlp_moe_modules" -> "moe_mlp_modules"

Add Qwen3.5 Moe lite awq

ea1fe52

Copilot AI review requested due to automatic review settings April 28, 2026 16:12

Copilot AI reviewed Apr 28, 2026

View reviewed changes

lvhan028 reviewed Apr 29, 2026

View reviewed changes


		from mmengine import Registry

		CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules.base'])

- tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
- model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True)
- vl_model = None
- work_dir = Path(work_dir)
- work_dir.mkdir(parents=True, exist_ok=True)
- return vl_model, model, tokenizer, work_dir
+ work_dir = Path(work_dir)
+ work_dir.mkdir(parents=True, exist_ok=True)
+ vl_model = None
+ tokenizer = None
+ llm_model = None
+ # Reuse the task-aware VLM loading path when available so the no-calib
+ # flow can handle conditional-generation / multimodal architectures.
+ try:
+ from lmdeploy.archs import get_task
+ from lmdeploy.vl.model.builder import load_vl_model
+ _, pipeline_class = get_task(model)
+ is_vl_task = pipeline_class is not None and hasattr(pipeline_class, 'is_vl') and pipeline_class.is_vl
+ if is_vl_task:
+ vl_model = load_vl_model(model, backend='huggingface')
+ if hasattr(vl_model, 'language_model'):
+ llm_model = vl_model.language_model
+ elif hasattr(vl_model, 'llm'):
+ llm_model = vl_model.llm
+ else:
+ raise AttributeError('Cannot find language model in loaded VLM.')
+ if hasattr(vl_model, 'tokenizer') and vl_model.tokenizer is not None:
+ tokenizer = vl_model.tokenizer
+ elif hasattr(vl_model, 'processor') and hasattr(vl_model.processor, 'tokenizer'):
+ tokenizer = vl_model.processor.tokenizer
+ except Exception:
+ # Fall back to the original text-only loading path if task-aware VLM
+ # loading is unavailable or the model is not a VLM.
+ vl_model = None
+ llm_model = None
+ tokenizer = None
+ if tokenizer is None:
+ tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
+ if llm_model is None:
+ llm_model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True)
+ return vl_model, llm_model, tokenizer, work_dir

- layer_type = LAYER_TYPE_MAP[type(model).__name__]
+ model_type = type(model).__name__
+ if model_type not in LAYER_TYPE_MAP:
+ supported_model_types = ', '.join(sorted(LAYER_TYPE_MAP.keys()))
+ raise RuntimeError(
+ f'Unsupported model type: {model_type}. '
+ f'Supported model types are: {supported_model_types}.')
+ layer_type = LAYER_TYPE_MAP[model_type]

- return False
+ return False
+ if isinstance(obj, (list, tuple)):
+ for item in obj:
+ if search(item):
+ return True
+ return False
+ if isinstance(obj, str):
+ return keyword in obj.lower()
+ if obj is None:
+ return False
+ return keyword in str(obj).lower()

		@@ -0,0 +1,3 @@
		# Copyright (c) OpenMMLab. All rights reserved.
		from .mixtral import MixtralMoeMLP # noqa: F401

Conversation

43758726 commented Apr 28, 2026

Motivation

Modification

Use cases (Optional)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants