[Intel HPU] Support intel hpu platform#4161
[Intel HPU] Support intel hpu platform#4161Jiang-Jia-Jun merged 13 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
7e59562 to
d7509a6
Compare
| try: | ||
| # assert len(paddle.static.cuda_places()) > 0 | ||
| return True | ||
| except Exception as e: |
There was a problem hiding this comment.
This check doesn't seem to work.
| # PACKAGE = "fastdeploy.model_executor.ops.intel_hpu" | ||
| PACKAGE = "paddlenlp_ops" | ||
|
|
||
| import_custom_ops(PACKAGE, "paddlenlp_ops", globals()) |
There was a problem hiding this comment.
here should be fastdeploy.model_executor.ops.intel_hpu instead of paddlenlp_ops ?
Is this because of the naming convention of the ops implementation in custom device?
There was a problem hiding this comment.
yes, real custom ops come from paddlecustomdevice, we just rename it in fastdeploy
| @@ -0,0 +1,21 @@ | |||
| # Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. | |||
| # | |||
| raise NotImplementedError | ||
|
|
||
|
|
||
| class AttentionBackend_HPU(AttentionBackend): |
There was a problem hiding this comment.
Will it be better to move this class to fastdeploy/model_executor/layers/attention/hpu_attn_backend.py ?
fastdeploy/engine/args_utils.py
Outdated
| "--enable-tensor-or-expert-parallel", | ||
| action='store_true', | ||
| default=EngineArgs.enable_tensor_or_expert_parallel, | ||
| help="Enable tensor parallelism for non-MoE and expert parallelism for MoE.") |
There was a problem hiding this comment.
could we enable tp + ep by setting --enable-expert-parallel and --tensor-parrllel-size without adding a new argument ?
There was a problem hiding this comment.
currently EP is combined with DP, so we can't enable tp + ep with existing parameters
https://github.com/PaddlePaddle/FastDeploy/blob/develop/fastdeploy/config.py#L316-L318
https://github.com/PaddlePaddle/FastDeploy/blob/develop/fastdeploy/model_executor/layers/moe/moe.py#L132-L134
fastdeploy/worker/worker_process.py
Outdated
|
|
||
| parallel_config.engine_worker_queue_port = parallel_config.engine_worker_queue_port[ | ||
| parallel_config.local_data_parallel_id | ||
| ] |
There was a problem hiding this comment.
All CI fails at this line. TypeError: '\''int'\'' object is not subscriptable' . We need to solve it first and then see if there are any other problems
| @@ -0,0 +1,314 @@ | |||
| """ | |||
There was a problem hiding this comment.
layers目录下有一个backends文件夹,里边放着各类device的layer有关的实现,把attention和moe的实现都放到这个文件夹下吧
There was a problem hiding this comment.
按照要求,已经移动到了backends目录下
| elif current_platform.is_intel_hpu(): | ||
| self.forward = self.forward_intel_hpu |
There was a problem hiding this comment.
forard_cuda名字可能现在已经不太合适叫这个了,应该是可以复用forward_cuda的,逻辑都是一样的
There was a problem hiding this comment.
已经改为复用forward_cuda
| elif current_platform.is_intel_hpu(): | ||
| self.forward = self.forward_intel_hpu |
There was a problem hiding this comment.
这个和其他硬件平台有何不同之处吗,为啥需要单独写逻辑,不能抽象为几个op然后调用forward_cuda吗
There was a problem hiding this comment.
目前我们采用fused的方式,是因为在我们平台上性能会比较好,后面会考虑在不影响性能的前提下进行拆分
| from fastdeploy.platforms import current_platform | ||
|
|
||
|
|
||
| def reload_ep_checkpoint(model_path: str, fd_config: FDConfig, state_dict: dict, return_numpy: bool = False): |
There was a problem hiding this comment.
为什么会修改加载模型这块儿的内容,是因为用的不是官方的模型吗
There was a problem hiding this comment.
没有修改模型,还是官方的模型,只是为了支持TP+EP模式的模型加载
There was a problem hiding this comment.
我们支持的TP+EP模式,是dense部分用TP,MoE部分不用TP(也不用DP),只用EP(EP的数目=TP)。所以在模型load的时候,如果配置了TP,默认会把MoE的系数也按照TP的模式切分了,这个reload_ep_checkpoint完成的功能,是把切分的MoE weights先删除掉,然后重新在expert维度将各自完整的weights划分给不同的卡。
fastdeploy/config.py
Outdated
| self.expert_parallel_size = 1 # EP degree | ||
| self.data_parallel_size = 1 # DP degree | ||
| self.enable_expert_parallel = False | ||
| self.enable_tensor_or_expert_parallel = False |
There was a problem hiding this comment.
这里不能通过enable_expert_parallel或者是expert_parallel_size,tensor_parallel_size等这些字段组合判断吗,必须要给用户接口加新字段吗
There was a problem hiding this comment.
目前FD里面EP是和DP是绑定的,EP size等于DP size,而且moe里面限制了TP和EP不能同时开,所以支持TP+EP 最佳的选择是加一个参数
https://github.com/PaddlePaddle/FastDeploy/blob/develop/fastdeploy/model_executor/layers/moe/moe.py#L132-L134
There was a problem hiding this comment.
这个参数的目的是让moe部分同时打开TP+EP并行吗?
There was a problem hiding this comment.
dense部分用TP,MoE部分用EP(EP的数目=TP)。
fastdeploy/engine/args_utils.py
Outdated
| cache_cfg = CacheConfig(all_dict) | ||
| load_cfg = LoadConfig(all_dict) | ||
| parallel_cfg = ParallelConfig(all_dict) | ||
| cache_cfg.enc_dec_block_num = self.static_decode_blocks |
There was a problem hiding this comment.
It could be better to set this value as https://github.com/PaddlePaddle/FastDeploy/blob/release/2.2/fastdeploy/config.py#L899 to avoid impact on other hardware.
There was a problem hiding this comment.
It's not only for specific platform. It maybe be a bug, the parameter "static_decode_blocks" in EngineArgs can't be passed to cache_cfg even on GPUs because it has no static_decode_blocks but enc_dec_block_num
There was a problem hiding this comment.
It seems that there is a problem that the static_decode_blocks not being passed to cache_cfg.
Could you please move enc_dec_block_num setting for different platforms to this file? Since this line works after the cache_cfg initialization, the default value 2 may cause error, e.g. Iluvatar
There was a problem hiding this comment.
after rebased to latest code, we can use FD_ENC_DEC_BLOCK_NUM to solve this problem. I had removed this line
fastdeploy/worker/worker_process.py
Outdated
| else: | ||
| num_experts = model_config.moe_num_experts | ||
|
|
||
| num_experts_per_rank = num_experts // parallel_config.tensor_parallel_size |
There was a problem hiding this comment.
目前FD里面的逻辑是如果enable EP,则需要用dp_size划分专家 + enable-expert-parallel,类似,我们可以用tp_size划分专家 + enable_tensor_or_expert_parallel来支持TP+EP模式(dense部分用TP,MoE部分用EP(EP的数目=TP))
There was a problem hiding this comment.
enable_tensor_or_expert_parallel这个参数感觉不是很清晰啊,这种dense TP moe EP的切分可以参考也开源框架vllm/SGLang的命名实现?现在看着会比较让人迷惑
There was a problem hiding this comment.
命名是有问题, 这个PR我们先去掉了相关的code,后面refine好后再提PR
5137adb to
2ae2c61
Compare
|
@zoooo0820 @carryyu @YuanRisheng @gzy19990617 ,我们暂时把TP+EP模式去掉了,后面refine好后再单独合并 |
2ae2c61 to
e81e85a
Compare
e81e85a to
cdc1d07
Compare
|
|
||
|
|
||
| @dataclass | ||
| class ForwardMeta_HPU: |
There was a problem hiding this comment.
命名是否可以与上面其他硬件保持一致呢,HPUForwardMeta
There was a problem hiding this comment.
已改为HPUForwardMeta
FastDeploy在Intel HPU上已完成ERNIE 4.5模型的适配
依赖信息:
Gaudi software: 1.22.0
PaddlePaddle:3.1.1
PaddleCustomDevice: latest develop branch
更多模型的支持和性能的优化会继续更新。