[Feature] RFC for adding CPU support for SGLang

### Motivation

Hi, SGLang folks! This is Mingfei from intel pytorch team, our team helps optimize PyTorch performance on CPU. I am also the PyTorch module maintainer for cpu performance. We would like to contribute to SGLang for CPU enabling and performance optimization.

### Targets
Our primary target is to optimize SGLang performance on Intel Xeon Scalable Processors (x86 server CPUs).
* Optimization will be focusing on Xeon with [Intel® Advanced Matrix Extensions](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html) support, including Sapphire Rapids(4th gen), Emerald Rapids(5th gen), Granite Rapids(6th gen).
* Native implementations or fallbacks will be provided for CPUs with other ISA to make it functional.
* Providing good performance per dollar.

### Limitations

* Kernels written in **avx512** and **amx-bf16**, requires **GCC11** or above.
* **BFloat16/Float16** will be enabled at the same time on CPU, but we only focus on **BFloat16** performance optimization at the current stage, **Float16** optimization will be added later on.

### Schedule for 25Q1
We will focusing on DeepSeek series at the moment to align with our internal development requirements and extend the model coverage later on.

#### Generic enabling/optimizations for sglang

- [x] CPU device enabling. We intend to enable CPU device with torch native backend first and then gradually replace all the performance critical components with C++ intrinsics kernels. https://github.com/sgl-project/sglang/pull/2806
- [x] fused kernels for `rms_norm`, `silu_and_mul`, sampling and so on.
- [x] radix attention kernels for extend and decoding.

#### DeepSeek performance optimizations
(we are currently mapping the work from [DeepSeek Multi-head Latent Attention (MLA) Throughput Optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations))
- [x] MLA decoding kernel optimization with head blocking.
- [x] DeepSeekMoE (FusedMoE)
- [x] fp8 kv cache (experimental)

#### Tensor Parallel
- [x] Map TP to the multiple sockets (numa nodes) on a single node CPU
- [ ] EPMoE

We hope to help more customers to build better user experience with deploying with sglang on CPU devices. Welcome any feedbacks, thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] RFC for adding CPU support for SGLang #2807

Motivation

Targets

Limitations

Schedule for 25Q1

Generic enabling/optimizations for sglang

DeepSeek performance optimizations

Tensor Parallel

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] RFC for adding CPU support for SGLang #2807

Description

Motivation

Targets

Limitations

Schedule for 25Q1

Generic enabling/optimizations for sglang

DeepSeek performance optimizations

Tensor Parallel

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions