vllm.model_executor.layers.quantization.utils.marlin_utils ¶
marlin_moe_intermediate_size ¶
Given Marlin packed weight matrices w1_packed, and w2_packed, return the MoE intermediate size N
Source code in vllm/model_executor/layers/quantization/utils/marlin_utils.py
moe_packed_to_marlin_zero_points ¶
moe_packed_to_marlin_zero_points(
q_zp_packed: Tensor,
size_k: int,
size_n: int,
num_bits: int,
is_a_8bit: bool = False,
)
Convert compressed-tensors packed zero points to Marlin format.
Unlike AWQ, compressed-tensors uses standard bit packing without interleaving, so we just unpack and apply Marlin permutation directly.