-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Closed
Labels
Description
Here is the development roadmap for 2025 Q1 and Q2. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). Previous 2024 Q4 roadmap can be found in #1487.
DeepSeek R1 optimization
Performance
- Support speculative decoding
- Eagle Optimization [feat] add small vocab table for eagle's draft model[1]. #3822
- Reference-based. refactor EAGLE 2 #3269
- Align with the speed of grok
- P/D Disaggregation
- Bump internal codes
- Mooncake Integration
Parallelism
- Support sequence parallelism [Feature] Add initial support for sequence parallelism #1436. Related paper
- Support pipeline parallelism.
- Optimize expert parallelism + data parallelism for DeepSeekmodels.
- Optimize expert parallelism for Qwen Models.
- Overlap communication in tensor parallelsim. @ZhuohaoL @fzyzcjy
Hardware Optimizations
- AMD optimizations. @HaiShaw @yiakwy-xpu-ml-framework-team
- Intel XPU optimization. @shanyu-sys
Model Coverage
- Multi-modal models
- merge all the PRs from @mickqian @yizhang2077
- support streaming models @yiranyyu
- VLM optimization @Lyken17
- Embed models
- encoder models @yichuan520030910320
New features
- Performance optimizations for multi-LoRA serving @Fridge003
- RLHF support with veRL team @zhaochenyang20
- GRPO of trl @jhinpan
- Optimize funciton calling and constraint decoding @minleminzui
Quantization
- unsloth model support @guapisolo @XueyingJia
Server API
- Support directly taking embedding as inputs. [Feature] Generation Inputs: input_embeds #745
- Add APIs for using the inference engine in a single script without launching a separate server. See also examples.
- Support endpoint other than OpenAI (Anthropic, Mistral) in the language frontend.
Observability
- Open-to-use Grafana / Prometheus @PopSoda2002 @ziliangpeng
Others
shuaills, zhaochenyang20, jhinpan and Swipe4057