Skip to content

Add Qwen3.5 Moe lite awq#4561

Open
43758726 wants to merge 1 commit intoInternLM:mainfrom
43758726:InternS2_preview_lite_awq
Open

Add Qwen3.5 Moe lite awq#4561
43758726 wants to merge 1 commit intoInternLM:mainfrom
43758726:InternS2_preview_lite_awq

Conversation

@43758726
Copy link
Copy Markdown
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Add lite awq quant for qwen3-moe, qwen3.5-dense and qwen3.5-moe model. This PR only supports data free awq quant for qwen3-moe, qwen3.5-dense and qwen3.5-moe model.

Modification

lmdeploy/lmdeploy/lite/apis/auto_awq.py: Add data free (no calibration) quant.
lmdeploy/lmdeploy/lite/mlp_moe_modules: Add convert moe module to moduelist format when transformers≥5.0.
lmdeploy/lmdeploy/lite/utils/convert_moe_params.py: The same with mlp_moe_modules.

Use cases (Optional)

Qwen3-moe

lmdeploy lite auto_awq 'model_path/repo' ----no-calib-ds-req

Qwen3.5-dense

lmdeploy lite auto_awq 'model_path/repo' ----no-calib-ds-req --mod-skip-quant visual linear_attn self_attn model.layers.0. mtp

Qwen3.5-moe

lmdeploy lite auto_awq 'model_path/repo' ----no-calib-ds-req --mod-skip-quant visual linear_attn self_attn model.layers.0. mtp shared_expert

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

Copilot AI review requested due to automatic review settings April 28, 2026 16:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends lmdeploy.lite AWQ quantization to better support Qwen3/Qwen3.5 MoE variants by adding a no-calibration (“data-free”) workflow, new module-skip options, and utilities to convert fused MoE experts into an unfused ModuleList form for module-wise quantization.

Changes:

  • Add --no-calib-ds-req (skip calibration) and --mod-skip-quant (skip specific modules) to the lite AWQ CLI/API.
  • Add MoE expert conversion utilities and new MLP expert module definitions (Qwen/Mixtral) for unfusing experts.
  • Extend calibration/AWQ layer-type mappings for Qwen3 MoE and Qwen3.5 model families.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
lmdeploy/lite/utils/convert_moe_params.py New utility to convert fused MoE expert weights into ModuleList experts.
lmdeploy/lite/utils/batch_split.py Update input splitting logic for position_embeddings to handle Qwen3.5 shapes.
lmdeploy/lite/utils/init.py Export convert_moe_parameters from lite utils.
lmdeploy/lite/quantization/awq.py Add Qwen3 MoE norm mapping; refine skip-module logic; allow extra skip patterns.
lmdeploy/lite/mlp_moe_modules/qwen.py Add Qwen MoE expert MLP module for unfused expert representation.
lmdeploy/lite/mlp_moe_modules/mixtral.py Add Mixtral MoE expert MLP module for unfused expert representation.
lmdeploy/lite/mlp_moe_modules/base.py Introduce a registry for MoE expert conversion modules.
lmdeploy/lite/mlp_moe_modules/init.py Import/register MoE module implementations.
lmdeploy/lite/apis/calibrate.py Extend model-type maps for Qwen3 MoE / Qwen3.5 (layer/norm/head mapping).
lmdeploy/lite/apis/auto_awq.py Add data-free quant path; MoE conversion invocation; pass skip patterns into quantization.
lmdeploy/cli/lite.py Add CLI flags for no-calibration quantization and module-skip patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


from mmengine import Registry

CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules.base'])
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONVERT_MOE_MODELS is initialized with locations=['lmdeploy.lite.mlp_moe_modules.base'], but the actual registrations are in lmdeploy/lite/mlp_moe_modules/qwen.py and mixtral.py. Since base.py doesn't import those modules, the registry will stay empty unless callers import lmdeploy.lite.mlp_moe_modules elsewhere, causing CONVERT_MOE_MODELS.get(...) to return None and MoE conversion to silently never run. Consider changing locations to ['lmdeploy.lite.mlp_moe_modules'] (package) or importing the concrete modules in base.py/package init so the registrations are guaranteed to execute.

Suggested change
CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules.base'])
CONVERT_MOE_MODELS = Registry('mlp moe module', locations=['lmdeploy.lite.mlp_moe_modules'])

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +30
tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True)
vl_model = None
work_dir = Path(work_dir)
work_dir.mkdir(parents=True, exist_ok=True)
return vl_model, model, tokenizer, work_dir


Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The no-calibration path (load_model) always uses load_hf_from_pretrained, which loads via AutoModelForCausalLM and always returns vl_model=None. This bypasses the VLM loading logic used in calibrate() (load_vl_model, language_model/llm extraction, etc.), so --no-calib-ds-req is likely to break for VLM/conditional-generation architectures such as Qwen3_5* (and will also skip save_vl_model). Suggest reusing get_task(...) + the same VLM/LLM loading branch as calibrate() (but skipping dataset calibration), rather than a separate loader.

Suggested change
tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True)
vl_model = None
work_dir = Path(work_dir)
work_dir.mkdir(parents=True, exist_ok=True)
return vl_model, model, tokenizer, work_dir
work_dir = Path(work_dir)
work_dir.mkdir(parents=True, exist_ok=True)
vl_model = None
tokenizer = None
llm_model = None
# Reuse the task-aware VLM loading path when available so the no-calib
# flow can handle conditional-generation / multimodal architectures.
try:
from lmdeploy.archs import get_task
from lmdeploy.vl.model.builder import load_vl_model
_, pipeline_class = get_task(model)
is_vl_task = pipeline_class is not None and hasattr(pipeline_class, 'is_vl') and pipeline_class.is_vl
if is_vl_task:
vl_model = load_vl_model(model, backend='huggingface')
if hasattr(vl_model, 'language_model'):
llm_model = vl_model.language_model
elif hasattr(vl_model, 'llm'):
llm_model = vl_model.llm
else:
raise AttributeError('Cannot find language model in loaded VLM.')
if hasattr(vl_model, 'tokenizer') and vl_model.tokenizer is not None:
tokenizer = vl_model.tokenizer
elif hasattr(vl_model, 'processor') and hasattr(vl_model.processor, 'tokenizer'):
tokenizer = vl_model.processor.tokenizer
except Exception:
# Fall back to the original text-only loading path if task-aware VLM
# loading is unavailable or the model is not a VLM.
vl_model = None
llm_model = None
tokenizer = None
if tokenizer is None:
tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
if llm_model is None:
llm_model = load_hf_from_pretrained(model, dtype=dtype, trust_remote_code=True)
return vl_model, llm_model, tokenizer, work_dir

Copilot uses AI. Check for mistakes.
else:
vl_model, model, tokenizer, work_dir = load_model(model, dtype, work_dir)

layer_type = LAYER_TYPE_MAP[type(model).__name__]
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calib_ds_req=False, this code bypasses calibrate()'s supported-model validation and then does layer_type = LAYER_TYPE_MAP[type(model).__name__], which will raise a raw KeyError for unsupported/renamed model classes. Consider adding the same explicit check and user-facing RuntimeError message that calibrate() uses (or reusing calibrate()'s model-type validation) so failures are actionable.

Suggested change
layer_type = LAYER_TYPE_MAP[type(model).__name__]
model_type = type(model).__name__
if model_type not in LAYER_TYPE_MAP:
supported_model_types = ', '.join(sorted(LAYER_TYPE_MAP.keys()))
raise RuntimeError(
f'Unsupported model type: {model_type}. '
f'Supported model types are: {supported_model_types}.')
layer_type = LAYER_TYPE_MAP[model_type]

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +23
'Qwen3MoeForCausalLM': 'Qwen3MoeDecoderLayer',
'Qwen3_5ForCausalLM': 'Qwen3_5DecoderLayer',
'Qwen3_5MoeForCausalLM': 'Qwen3_5MoeDecoderLayer',
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new entries use Qwen3_5ForCausalLM / Qwen3_5MoeForCausalLM, but elsewhere in the repo Qwen3.5 is represented as Qwen3_5ForConditionalGeneration / Qwen3_5MoeForConditionalGeneration (e.g. lmdeploy/vl/model/qwen3_5.py, turbomind/supported_models.py). With the current keys, type(model).__name__ will likely never match and calibration/quantization will fail. Please align the map keys with the actual architecture class names (and update the corresponding NORM_TYPE_MAP / HEAD_NAME_MAP entries too).

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +38
'Qwen3MoeDecoderLayer': {
'input_layernorm': ['self_attn.k_proj', 'self_attn.q_proj', 'self_attn.v_proj'],
'post_attention_layernorm': ['mlp.gate_proj', 'mlp.up_proj']
},
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NORM_FCS_MAP adds support for Qwen3MoeDecoderLayer, but FC_FCS_MAP has no corresponding Qwen3MoeDecoderLayer entry. In auto_awq with calib_ds_req=True (default), it does fc2fcs = FC_FCS_MAP[layer_type], so Qwen3-MoE will currently hit a KeyError during smoothing/AWQ. Similarly, the newly added Qwen3.5 layer types in calibrate.py need entries in both maps to work with calibration-based AWQ.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +49

return False

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_contains_keyword claims to recursively search config keys or string values, but the implementation only recurses into dict values and ignores lists/tuples/strings entirely. For HF configs, to_dict() commonly contains nested lists/dicts, so this can incorrectly return False and prevent MoE detection/conversion. Update search() to handle dict, list/tuple, and str (and optionally other primitive types via str(obj)), consistent with the docstring.

Suggested change
return False
return False
if isinstance(obj, (list, tuple)):
for item in obj:
if search(item):
return True
return False
if isinstance(obj, str):
return keyword in obj.lower()
if obj is None:
return False
return keyword in str(obj).lower()

Copilot uses AI. Check for mistakes.
search_scale=search_scale,
dtype=dtype,
batch_size=batch_size)
input_stats = torch.load(osp.join(work_dir, 'inputs_stats.pth'), weights_only=True)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_stats is loaded here but never used (it gets reloaded inside the later if calib_ds_req: block). This is unnecessary IO and can noticeably slow down quantization for large stats files; consider removing this load or using the already-loaded input_stats later.

Suggested change
input_stats = torch.load(osp.join(work_dir, 'inputs_stats.pth'), weights_only=True)

Copilot uses AI. Check for mistakes.
Comment on lines +145 to 152
is_moe = (
'moe' in model.config.model_type.lower() or
config_contains_keyword(model.config, 'experts')
)
for l_name, layer in layers.items():
if is_moe:
convert_moe_parameters(model_path, layer)
name2fc = collect_target_modules(layer, nn.Linear, prefix=l_name)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_moe_parameters(model_path, layer) is called once per decoder layer, but convert_moe_parameters recomputes the registered model name from model_path each time (which calls get_model_arch(...) and reads config). This becomes O(num_layers) config parsing overhead. Consider computing model_name once in auto_awq (or once in convert_moe_parameters via caching) and passing it down so conversion stays cheap.

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +133
def match_builtin_skkiped_pattern(name: str, pattern: str):
if pattern == 'lora':
return pattern in name
return name == pattern or name.endswith(f'.{pattern}') or f'.{pattern}.' in name
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in helper name: match_builtin_skkiped_pattern has an extra 'k' ("skkiped") and also contains a double space in if pattern == 'lora':. Renaming to match_builtin_skipped_pattern (and updating call sites) would improve readability and avoid propagating the typo into future usages.

Copilot uses AI. Check for mistakes.
Comment thread lmdeploy/cli/lite.py
Comment on lines +42 to +47
parser.add_argument('--no-calib-ds-req',
dest='calib_ds_req',
action='store_false',
default=True,
help='Require calibration dataset before quantizing weights. '
'Default to True. Set to False to skip calibration and directly quantize weights')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unnessary to define another option.
We can use "calib_samples=0" to indicate the data free quantization

Comment thread lmdeploy/cli/lite.py
default=True,
help='Require calibration dataset before quantizing weights. '
'Default to True. Set to False to skip calibration and directly quantize weights')
parser.add_argument('--mod-skip-quant',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--mod-skip-quant is awkward. dest is redudant

parser.add_argument('--exclude-modules',
                    nargs='+',
                    metavar='PATTERN',
                    default=None,
                    help='One or more module name patterns (glob‑style) to exclude from quantization. '
     'Example: --exclude-modules "*.lm_head" "transformer.layers.*.ffn"'')

save_vl_model(vl_model, model_path, work_dir)
else:
model.save_pretrained(work_dir, safe_serialization=True)
# model.save_pretrained(work_dir, safe_serialization=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May remove the unused code

def convert_experts(experts_mod: nn.Module, moemlp_cls) -> nn.ModuleList:
"""Convert fused MoE expert weights into a ModuleList of MLP experts
without copying."""
num_experts, intermediate_size_2, hidden_size = experts_mod.gate_up_proj.shape
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it specifically for Qwen3-MoE?

@@ -0,0 +1,3 @@
# Copyright (c) OpenMMLab. All rights reserved.
from .mixtral import MixtralMoeMLP # noqa: F401
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"mlp_moe_modules" -> "moe_mlp_modules"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants