[Feature] support pooling model dummy_run#4345
[Feature] support pooling model dummy_run#4345Jiang-Jia-Jun merged 22 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
| from fastdeploy.engine.pooling_params import PoolingParams | ||
| from fastdeploy.engine.tasks import PoolingTask |
There was a problem hiding this comment.
从 engine import 东西到底层是合理的吗
There was a problem hiding this comment.
这里是参考vllm做法,它是vllm/tasks,我就放到engine底下了
| class FdModel(Protocol[T_co]): | ||
| """The interface required for all models in FastDeploy.""" |
There was a problem hiding this comment.
哪些类会继承FDModel,和 ModelForCasualLM 是啥关系
There was a problem hiding this comment.
只有FDModelForPooling继承,和ModelForCasualLM没关系,ModelForCasualLM有compute_logits,pooling模型不计算这个
| [num_reqs, req_num_tokens], | ||
| dtype="int32", | ||
| ) | ||
| model = cast(FdModelForPooling, self.get_model()) |
There was a problem hiding this comment.
同上,FdModelForPooling 和 ModelForCasualLM 关系是什么,一定要cast吗
There was a problem hiding this comment.
这里是设置一些默认的pooling_type(如果用户不设置),是需要cast的
| to_update = model.pooler.get_pooling_updates(task) | ||
| to_update.apply(dummy_pooling_params) |
| cumsum = paddle.zeros([n_seq + 1], dtype="int64") | ||
| if cumsum.place.is_gpu_place(): | ||
| cumsum = cumsum.cpu() |
There was a problem hiding this comment.
这里为啥不直接zeros一个cpu tensor ?
|
|
||
| self.attn_backends.append(attn_backend) | ||
|
|
||
| def _dummy_pooler_run_task( |
There was a problem hiding this comment.
为什么不直接实现在_dummy_pooler_run中,而是单独抽出一个_dummy_pooler_run_task ?
| self.speculative_decoding = self.speculative_method is not None | ||
| self.enable_logprob = fd_config.model_config.enable_logprob | ||
| self.enable_early_stop = self.fd_config.early_stop_config.enable_early_stop | ||
| self.is_pooling_model = self.fd_config.model_config.runner_type == "pooling" |
There was a problem hiding this comment.
self.is_pooling_model和is_pooling_model是否能去除一个?有都存在的必要性吗?
There was a problem hiding this comment.
去除了is_pooling_model,保留了self.is_pooling_model
7d53ef8 to
7bae906
Compare
支持pooling模型dummy_pooler_run,以及将之前生成式模型预热阶段重构为dummy_sampler_run,并修复qwen3-embeeding-0.6B单卡加载的bug