[Feature] support pool by lizexu123 · Pull Request #3827 · PaddlePaddle/FastDeploy

lizexu123 · 2025-09-02T14:11:35Z

Pooling Model共需要支持4个模块，ModelConfig，Model Loader Model Runner中分为模型预热和模型执行
本pr完成了ModelConfig和ModelLoader，目前仅支持了runner为pooling，convert为embed的情况

本pr完成内容:

1.支持服务启动传递runner 为pooling，也可以传入convert为embed,不传递时可根据模型文件来判断convert类型。
2.支持判断模型为生成式模型还是pooling模型，如果为生成式模型，指定runner为pooling，则需要进行模型转换。
3.模型转换，支持生成式模型转换为pooling模型，对应删除ParallelLMHead权重，替换architectures为ForEmbedding,并且添加DispatchPooler层，根据pooling_type来决定是哪一个pool层。
4.支持Qwen3-embedding-0.6B 单卡模型load成功，多卡目前还有报错

待完成:
1.model Runner:
1.模型预热阶段
2.模型执行阶段
待解决:
使用tp>1 时加载Qwen3-Embedding-0.6B时会报错

启动pooling任务

model_path=/root/paddlejob/workspace/env_run/output/models/torch/Qwen3-0.6B
python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
    --max-num-seqs 256 --max-model-len 32768 \
    --port 9412 --engine-worker-queue-port 7142 \
    --metrics-port 7211 --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --load_choices "default_v1" \
    --runner pooling

Qwen3-0.6B生成式模型转换过程：内部会区分出convert为embed还是reward,score(这两个暂时未支持)，对生成式模型转换成embedding模型，将lm_head权重删除，将architectures的后缀改成ForEmbedding，并添加DispatchPooler层，转换后self.model如下图所示

paddle-bot · 2025-09-02T14:11:47Z

Thanks for your contribution!

…into pooling_emb

Copilot

Pull Request Overview

This PR implements pooling model support for FastDeploy by introducing configurable model runners and conversion mechanisms. The implementation enables embedding and pooling tasks while maintaining compatibility with existing text generation models.

Key changes:

Introduces new runner types ("pooling", "generate") and conversion options ("embed", "none") with automatic detection
Implements model registry refactoring with lazy loading and better architecture support
Adds comprehensive pooling infrastructure including poolers, metadata, and output handling

Reviewed Changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
fastdeploy/config.py	Core configuration for runner types, convert options, and pooler configuration
fastdeploy/model_executor/models/registry.py	New model registry with lazy loading and pooling model detection
fastdeploy/model_executor/models/adapters.py	Model conversion utilities for embedding and pooling models
fastdeploy/model_executor/layers/pooler.py	Pooling layer implementations with different pooling strategies
fastdeploy/transformer_utils/config.py	Configuration utilities for sentence transformers and pooling models
fastdeploy/worker/worker_process.py	Command line argument parsing for new pooling options
fastdeploy/model_executor/model_loader/default_loader_v1.py	Model loading with conversion support

fastdeploy/model_executor/model_loader/default_loader_v1.py

fastdeploy/model_executor/layers/activation.py

fastdeploy/model_executor/utils.py

fastdeploy/model_executor/models/model_base.py

fastdeploy/model_executor/layers/pooler.py

fastdeploy/config.py

Copilot

Pull Request Overview

Copilot reviewed 28 out of 29 changed files in this pull request and generated 6 comments.

Copilot · 2025-09-16T03:29:14Z

fastdeploy/model_executor/utils.py



+def is_pin_memory_available() -> bool:
+    pass


Function is_pin_memory_available is incomplete with just a pass statement. This will always return None instead of a boolean value, which could cause issues where this function is used.

Suggested change

pass

# Pin memory is available if PaddlePaddle is compiled with CUDA support

return paddle.is_compiled_with_cuda()

Copilot · 2025-09-16T03:29:14Z

fastdeploy/model_executor/models/qwen3.py

+        for loaded_weight_name, loaded_weight in weights_iterator:
+            if "rotary_emb.inv_freq" in loaded_weight_name:
+                continue


The hardcoded string check for 'rotary_emb.inv_freq' should be moved to a constant or configuration to improve maintainability and avoid magic strings scattered throughout the code.

Copilot · 2025-09-16T03:29:15Z

fastdeploy/model_executor/models/adapters.py

+                try:
+                    loaded_weight = loaded_weight.reshape(linear.weight.shape)
+                except:
+                    continue


Using bare except: is not recommended as it catches all exceptions including system exits and keyboard interrupts. Use specific exception types like except (ValueError, RuntimeError): or at minimum except Exception:.

Copilot · 2025-09-16T03:29:15Z

fastdeploy/model_executor/models/adapters.py

+                    if linear.bias.shape != loaded_bias.shape:
+                        try:
+                            loaded_bias = loaded_bias.reshape(linear.bias.shape)
+                        except:


Using bare except: is not recommended as it catches all exceptions including system exits and keyboard interrupts. Use specific exception types like except (ValueError, RuntimeError): or at minimum except Exception:.

Suggested change

except:

except Exception:

Copilot · 2025-09-16T03:29:15Z

fastdeploy/model_executor/layers/pooler.py

+        assert not pooling_cursor.is_partial_prefill(), "partial prefill not supported with MEAN pooling"
+
+        if hidden_states.place.is_gpu_place():
+            prompt_lens = pooling_cursor.prompt_lens_cpu.cuda()


The .cuda() method call appears to be PyTorch-specific but this is a PaddlePaddle codebase. This should use PaddlePaddle's device placement methods like .gpu() or .to(device='gpu').

Suggested change

prompt_lens = pooling_cursor.prompt_lens_cpu.cuda()

prompt_lens = pooling_cursor.prompt_lens_cpu.to(device='gpu')

Copilot · 2025-09-16T03:29:15Z

fastdeploy/model_executor/layers/pool/metadata.py

+    cumsum = paddle.zeros([n_seq + 1], dtype="int64", place=paddle.CPUPlace())
+    paddle.cumsum(num_scheduled_tokens, axis=0, out=cumsum[1:])
+    if device == "gpu":
+        cumsum_device = cumsum.cuda()


The .cuda() method call appears to be PyTorch-specific but this is a PaddlePaddle codebase. This should use PaddlePaddle's device placement methods like .gpu() or .to(device='gpu').

Suggested change

cumsum_device = cumsum.cuda()

cumsum_device = cumsum.to(device='gpu')

YuanRisheng

这个PR需要加单测，新增代码太多了，没有单测的情况下FD代码库覆盖率下降会比较明显

yuanlehome · 2025-09-17T13:59:14Z

fastdeploy/model_executor/utils.py

+        return {out_name: value for name, value in values.items() if (out_name := self._map_name(name)) is not None}
+
+
+class AutoWeightsLoader:


这个东西是一定必须加的吗？有什么好处？麻烦解释清楚

这个没必要加，已删除

yuanlehome · 2025-09-17T13:59:39Z

fastdeploy/model_executor/utils.py

+
+
+@dataclass
+class WeightsMapper:


这个东西是一定必须加的吗？有什么好处？麻烦解释清楚

这个是方便将比如model.layers.0.self_attn.o_proj.weight,替换成layers.0.self_attn.o_proj.weight，在目前的用处中，还有子字符串替换和后缀替换，也可以用 weights = ((name[6:], data) for name, data in weights if name.startswith("model."))用这个来替换跑通，上面的方式更优雅

yuanlehome · 2025-09-17T14:00:38Z

fastdeploy/output/pooler.py

@@ -0,0 +1,69 @@
+"""


文件名命名为pooler.py吧

yuanlehome · 2025-09-17T14:02:51Z

fastdeploy/worker/gpu_model_runner.py

-            # 4. Execute spec decode
-            logits = self.model.compute_logits(hidden_states)
+            logits = None
+            if hasattr(self.model, "is_pooling_model") and self.model.is_pooling_model:
+                pass
+            else:
+                # 4. Execute spec decode
+                logits = self.model.compute_logits(hidden_states)


这个写在compute_logits里面是不是更好，compute_logits里可以获取到self的

每个model里面都有compute_logits，这样都得写一遍，这个是临时方案，等下一个pr就不要这个了

yuanlehome · 2025-09-17T14:03:33Z

fastdeploy/worker/worker_process.py

+def parse_type(return_type: Callable[[str], T]) -> Callable[[str], T]:
+
+    def _parse_type(val: str) -> T:
+        try:
+            return return_type(val)
+        except ValueError as e:
+            raise argparse.ArgumentTypeError(f"Value {val} cannot be converted to {return_type}.") from e
+
+    return _parse_type
+
+
+def optional_type(return_type: Callable[[str], T]) -> Callable[[str], Optional[T]]:
+
+    def _optional_type(val: str) -> Optional[T]:
+        if val == "" or val == "None":
+            return None
+        return parse_type(return_type)(val)
+


写到utils文件里

yuanlehome · 2025-09-17T14:06:34Z

fastdeploy/config.py

+        self.runner = "auto"
+        self.convert = "auto"
+        self.pooler_config: Optional["PoolerConfig"] = field(init=False)
+        self.override_pooler_config: Optional[Union[dict, "PoolerConfig"]] = None
+        self.revision = None


为什么不单独加一个PoolerConfig？把runner/convert等都加进去

pooler.py中创建DisPatchPooler，里面的ResolvedPoolingConfig它的from_config中需要pooler_config

yuanlehome · 2025-09-17T14:07:06Z

fastdeploy/config.py

+    def registry(self):
+        from fastdeploy.model_executor.models.model_base import ModelRegistry
+
+        return ModelRegistry()


model_config怎么还返回了一个ModelRegistry呢

我写的方法都不是类方法，旧的还是类方法

yuanlehome · 2025-09-18T05:57:58Z

fastdeploy/model_executor/utils.py

+from fastdeploy.utils import get_logger
+
+logger = get_logger("utils", "utils.log")


添加这个日志文件，怎么没看到有使用，删去？

yuanlehome · 2025-09-18T05:58:23Z

fastdeploy/model_executor/utils.py

+def is_pin_memory_available() -> bool:
+    pass


这个有用吗？不用先不用添加

yuanlehome · 2025-09-18T05:59:09Z

fastdeploy/worker/worker_process.py

+T = TypeVar("T")
+


有用吗？

yuanlehome · 2025-09-18T06:02:43Z

fastdeploy/model_executor/models/qwen2_5_vl/qwen2_5_vl.py


+@ModelRegistry.register_model_class(
+    architecture="Qwen2_5_VLForConditionalGeneration",
+    module_path="qwen2_5_vl",


这个怎么记得之前写的module_path是qwen2_5_vl.qwen2_5_vl，不用指定目录了吗？

已添加，指定目录

support pool

8e0c8d4

lizexu123 added 15 commits September 8, 2025 16:42

update pooling

57795ea

merge develop

302adb0

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

87237b0

…into pooling_emb

add pooler_config and check

cef99ec

update

926c796

support AutoWeightsLoader load weight

a76a43e

fix

344a8df

update

5ec6a93

merge develop

2235785

delete print

98b32fc

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

4f90dfc

…into pooling_emb

update pre-commit

0ef4c9a

fix

1daacb7

fix xpu

9d1a011

fix ModelRegistry->model_registry

2eff16e

Jiang-Jia-Jun requested a review from Copilot September 16, 2025 02:56

Copilot AI reviewed Sep 16, 2025

View reviewed changes

lizexu123 added 6 commits September 16, 2025 11:02

fix Copilot review

9fcd05e

fix pooler.py

945cba5

delete StepPooler

6f545aa

fix abstract

222d1b2

fix default_loader_v1

97b4649

fix Pre Commit

4cf6164

Jiang-Jia-Jun requested a review from Copilot September 16, 2025 03:28

Copilot AI reviewed Sep 16, 2025

View reviewed changes

YuanRisheng reviewed Sep 16, 2025

View reviewed changes

support torch qwen3 dense

db0a4bf

lizexu123 force-pushed the pooling_emb branch from 6c81eec to db0a4bf Compare September 16, 2025 08:51

yuanlehome reviewed Sep 17, 2025

View reviewed changes

lizexu123 added 3 commits September 18, 2025 13:48

Modefy ModelRegistry and delete AutoWeightsLoader

41aa2c5

fix logger

8e92eb4

fix test_embedding

91f777e

yuanlehome reviewed Sep 18, 2025

View reviewed changes

lizexu123 added 6 commits September 18, 2025 14:02

delete T

a90a091

fix ci bug

6f6c549

ernie4_5 model_registry

1fdf477

fix test

9e4d1fa

fix test

ebf4e0c

support Qwen3-Embedding-0.6B tp=1 load

27ec018

lizexu123 force-pushed the pooling_emb branch from 5f0ecb6 to 27ec018 Compare September 19, 2025 08:58

lizexu123 added 5 commits September 19, 2025 17:16

fix extra code

adc5b8f

fix

5e264f3

delete fix vocab_size

798f788

delete prepare_params_dict

b72ac60

fix:

a1de646

Jiang-Jia-Jun approved these changes Sep 22, 2025

View reviewed changes

Jiang-Jia-Jun merged commit c86945e into PaddlePaddle:develop Sep 22, 2025
14 of 17 checks passed

This was referenced Sep 22, 2025

[Feature] support qwen3-embedding model load #4200

Closed

[Feature] support qwen3-embedding model load #4202

Merged

	pass
	# Pin memory is available if PaddlePaddle is compiled with CUDA support
	return paddle.is_compiled_with_cuda()

	prompt_lens = pooling_cursor.prompt_lens_cpu.cuda()
	prompt_lens = pooling_cursor.prompt_lens_cpu.to(device='gpu')

	cumsum_device = cumsum.cuda()
	cumsum_device = cumsum.to(device='gpu')

		return {out_name: value for name, value in values.items() if (out_name := self._map_name(name)) is not None}


		class AutoWeightsLoader:

		from fastdeploy.utils import get_logger

		logger = get_logger("utils", "utils.log")

Conversation

lizexu123 commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

本pr完成内容:

启动pooling任务

Uh oh!

paddle-bot bot commented Sep 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

YuanRisheng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lizexu123 Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

lizexu123 commented Sep 2, 2025 •

edited

Loading

lizexu123 Sep 18, 2025 •

edited

Loading