[Feature] support pool#3827
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull Request Overview
This PR implements pooling model support for FastDeploy by introducing configurable model runners and conversion mechanisms. The implementation enables embedding and pooling tasks while maintaining compatibility with existing text generation models.
Key changes:
- Introduces new runner types ("pooling", "generate") and conversion options ("embed", "none") with automatic detection
- Implements model registry refactoring with lazy loading and better architecture support
- Adds comprehensive pooling infrastructure including poolers, metadata, and output handling
Reviewed Changes
Copilot reviewed 28 out of 29 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/config.py | Core configuration for runner types, convert options, and pooler configuration |
| fastdeploy/model_executor/models/registry.py | New model registry with lazy loading and pooling model detection |
| fastdeploy/model_executor/models/adapters.py | Model conversion utilities for embedding and pooling models |
| fastdeploy/model_executor/layers/pooler.py | Pooling layer implementations with different pooling strategies |
| fastdeploy/transformer_utils/config.py | Configuration utilities for sentence transformers and pooling models |
| fastdeploy/worker/worker_process.py | Command line argument parsing for new pooling options |
| fastdeploy/model_executor/model_loader/default_loader_v1.py | Model loading with conversion support |
fastdeploy/model_executor/utils.py
Outdated
|
|
||
|
|
||
| def is_pin_memory_available() -> bool: | ||
| pass |
There was a problem hiding this comment.
Function is_pin_memory_available is incomplete with just a pass statement. This will always return None instead of a boolean value, which could cause issues where this function is used.
| pass | |
| # Pin memory is available if PaddlePaddle is compiled with CUDA support | |
| return paddle.is_compiled_with_cuda() |
| for loaded_weight_name, loaded_weight in weights_iterator: | ||
| if "rotary_emb.inv_freq" in loaded_weight_name: | ||
| continue |
There was a problem hiding this comment.
The hardcoded string check for 'rotary_emb.inv_freq' should be moved to a constant or configuration to improve maintainability and avoid magic strings scattered throughout the code.
| try: | ||
| loaded_weight = loaded_weight.reshape(linear.weight.shape) | ||
| except: | ||
| continue |
There was a problem hiding this comment.
Using bare except: is not recommended as it catches all exceptions including system exits and keyboard interrupts. Use specific exception types like except (ValueError, RuntimeError): or at minimum except Exception:.
| if linear.bias.shape != loaded_bias.shape: | ||
| try: | ||
| loaded_bias = loaded_bias.reshape(linear.bias.shape) | ||
| except: |
There was a problem hiding this comment.
Using bare except: is not recommended as it catches all exceptions including system exits and keyboard interrupts. Use specific exception types like except (ValueError, RuntimeError): or at minimum except Exception:.
| except: | |
| except Exception: |
| assert not pooling_cursor.is_partial_prefill(), "partial prefill not supported with MEAN pooling" | ||
|
|
||
| if hidden_states.place.is_gpu_place(): | ||
| prompt_lens = pooling_cursor.prompt_lens_cpu.cuda() |
There was a problem hiding this comment.
The .cuda() method call appears to be PyTorch-specific but this is a PaddlePaddle codebase. This should use PaddlePaddle's device placement methods like .gpu() or .to(device='gpu').
| prompt_lens = pooling_cursor.prompt_lens_cpu.cuda() | |
| prompt_lens = pooling_cursor.prompt_lens_cpu.to(device='gpu') |
| cumsum = paddle.zeros([n_seq + 1], dtype="int64", place=paddle.CPUPlace()) | ||
| paddle.cumsum(num_scheduled_tokens, axis=0, out=cumsum[1:]) | ||
| if device == "gpu": | ||
| cumsum_device = cumsum.cuda() |
There was a problem hiding this comment.
The .cuda() method call appears to be PyTorch-specific but this is a PaddlePaddle codebase. This should use PaddlePaddle's device placement methods like .gpu() or .to(device='gpu').
| cumsum_device = cumsum.cuda() | |
| cumsum_device = cumsum.to(device='gpu') |
YuanRisheng
left a comment
There was a problem hiding this comment.
这个PR需要加单测,新增代码太多了,没有单测的情况下FD代码库覆盖率下降会比较明显
6c81eec to
db0a4bf
Compare
fastdeploy/model_executor/utils.py
Outdated
| return {out_name: value for name, value in values.items() if (out_name := self._map_name(name)) is not None} | ||
|
|
||
|
|
||
| class AutoWeightsLoader: |
There was a problem hiding this comment.
这个东西是一定必须加的吗?有什么好处?麻烦解释清楚
fastdeploy/model_executor/utils.py
Outdated
|
|
||
|
|
||
| @dataclass | ||
| class WeightsMapper: |
There was a problem hiding this comment.
这个东西是一定必须加的吗?有什么好处?麻烦解释清楚
There was a problem hiding this comment.
这个是方便将比如model.layers.0.self_attn.o_proj.weight,替换成layers.0.self_attn.o_proj.weight,在目前的用处中,还有子字符串替换和后缀替换,也可以用 weights = ((name[6:], data) for name, data in weights if name.startswith("model."))用这个来替换跑通,上面的方式更优雅
| @@ -0,0 +1,69 @@ | |||
| """ | |||
| # 4. Execute spec decode | ||
| logits = self.model.compute_logits(hidden_states) | ||
| logits = None | ||
| if hasattr(self.model, "is_pooling_model") and self.model.is_pooling_model: | ||
| pass | ||
| else: | ||
| # 4. Execute spec decode | ||
| logits = self.model.compute_logits(hidden_states) |
There was a problem hiding this comment.
这个写在compute_logits里面是不是更好,compute_logits里可以获取到self的
There was a problem hiding this comment.
每个model里面都有compute_logits,这样都得写一遍,这个是临时方案,等下一个pr就不要这个了
fastdeploy/worker/worker_process.py
Outdated
| def parse_type(return_type: Callable[[str], T]) -> Callable[[str], T]: | ||
|
|
||
| def _parse_type(val: str) -> T: | ||
| try: | ||
| return return_type(val) | ||
| except ValueError as e: | ||
| raise argparse.ArgumentTypeError(f"Value {val} cannot be converted to {return_type}.") from e | ||
|
|
||
| return _parse_type | ||
|
|
||
|
|
||
| def optional_type(return_type: Callable[[str], T]) -> Callable[[str], Optional[T]]: | ||
|
|
||
| def _optional_type(val: str) -> Optional[T]: | ||
| if val == "" or val == "None": | ||
| return None | ||
| return parse_type(return_type)(val) | ||
|
|
| self.runner = "auto" | ||
| self.convert = "auto" | ||
| self.pooler_config: Optional["PoolerConfig"] = field(init=False) | ||
| self.override_pooler_config: Optional[Union[dict, "PoolerConfig"]] = None | ||
| self.revision = None |
There was a problem hiding this comment.
为什么不单独加一个PoolerConfig?把runner/convert等都加进去
There was a problem hiding this comment.
pooler.py中创建DisPatchPooler,里面的ResolvedPoolingConfig它的from_config中需要pooler_config
| def registry(self): | ||
| from fastdeploy.model_executor.models.model_base import ModelRegistry | ||
|
|
||
| return ModelRegistry() |
There was a problem hiding this comment.
model_config怎么还返回了一个ModelRegistry呢
There was a problem hiding this comment.
我写的方法都不是类方法,旧的还是类方法
fastdeploy/model_executor/utils.py
Outdated
| from fastdeploy.utils import get_logger | ||
|
|
||
| logger = get_logger("utils", "utils.log") |
fastdeploy/model_executor/utils.py
Outdated
| def is_pin_memory_available() -> bool: | ||
| pass |
fastdeploy/worker/worker_process.py
Outdated
| T = TypeVar("T") | ||
|
|
|
|
||
| @ModelRegistry.register_model_class( | ||
| architecture="Qwen2_5_VLForConditionalGeneration", | ||
| module_path="qwen2_5_vl", |
There was a problem hiding this comment.
这个怎么记得之前写的module_path是qwen2_5_vl.qwen2_5_vl,不用指定目录了吗?
5f0ecb6 to
27ec018
Compare
Pooling Model共需要支持4个模块,ModelConfig,Model Loader Model Runner中分为模型预热和模型执行
本pr完成了ModelConfig和ModelLoader,目前仅支持了runner为pooling,convert为embed的情况
本pr完成内容:
1.支持服务启动传递runner 为pooling,也可以传入convert为embed,不传递时可根据模型文件来判断convert类型。
2.支持判断模型为生成式模型还是pooling模型,如果为生成式模型,指定runner为pooling,则需要进行模型转换。
3.模型转换,支持生成式模型转换为pooling模型,对应删除ParallelLMHead权重,替换architectures为ForEmbedding,并且添加DispatchPooler层,根据pooling_type来决定是哪一个pool层。
4.支持Qwen3-embedding-0.6B 单卡模型load成功,多卡目前还有报错
待完成:
1.model Runner:
1.模型预热阶段
2.模型执行阶段
待解决:
使用tp>1 时加载Qwen3-Embedding-0.6B时会报错
启动pooling任务
Qwen3-0.6B生成式模型转换过程:内部会区分出convert为embed还是reward,score(这两个暂时未支持),对生成式模型转换成embedding模型,将lm_head权重删除,将architectures的后缀改成ForEmbedding,并添加DispatchPooler层,转换后self.model如下图所示
