Skip to content

feat: Refactor DeepSeekV3 function call #5908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 2, 2025

Conversation

CatherineSue
Copy link
Collaborator

@CatherineSue CatherineSue commented Apr 30, 2025

Motivation

Instead of keeping editing the prompt engineering for deepseekv3 in adapter.py, this PR provides a chat template instead.

Implements: #5224 (comment)

Modifications

  • Add an adapted tool call template for deepseekv3 (credits to @finger92 's original improved chat template)
  • Remove promopt engineering for deepseekv3 tool call in adapter.py
  • Update docs/reference/deepseek.md

Checklist

- Add an adapted tool call template for deepseekv3
- Remove promopt engineering for deepseekv3 tool call in adapter.py
- Update `docs/reference/deepseek.md`
@CatherineSue
Copy link
Collaborator Author

CatherineSue commented Apr 30, 2025

also fix #5814

Test with prompt in the issue

{"id":"e5f041968f104ba3b964fe862c7096b7","object":"chat.completion","created":1745994924,"model":"sgl-model","choices":[{"index":0,"message":{"role":"assistant","content":null,"reasoning_content":null,"tool_calls":[{"id":"2","index":null,"type":"function","function":{"name":"get_current_weather","arguments":"{\"city\": \"Tokyo\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","matched_stop":null}],"usage":{"prompt_tokens":254,"total_tokens":275,"completion_tokens":21,"prompt_tokens_details":null}}#        

Test with multiple system messages
request:

curl "http://127.0.0.1:8080/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "temperature": 0,
    "max_tokens": 100,
    "model": "deepseek-ai/DeepSeek-V3-0324",
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "query_weather",
          "description": "Get weather of a city, the user should supply a city first",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {
                "type": "string",
                "description": "The city, e.g. Beijing"
              }
            },
            "required": ["city"]
          }
        }
      }
    ],
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant with access to weather tools."
      },
      {
        "role": "user",
        "content": "Hows the weather like in Qingdao today"
      },
      {
        "role": "system",
        "content": "Please always answer in concise sentences."
      }
    ]
  }'

response:

{"id":"4582a7a1775a49578bffcb3c672b444c","object":"chat.completion","created":1745995026,"model":"deepseek-ai/DeepSeek-V3-0324","choices":[{"index":0,"message":{"role":"assistant","content":null,"reasoning_content":null,"tool_calls":[{"id":"0","index":null,"type":"function","function":{"name":"query_weather","arguments":"{\"city\": \"Qingdao\"}"}}]},"logprobs":null,"finish_reason":"tool_calls","matched_stop":null}],"usage":{"prompt_tokens":133,"total_tokens":155,"completion_tokens":22,"prompt_tokens_details":null}}

@CatherineSue
Copy link
Collaborator Author

debug info for chat message after applying the chat template:

Screenshot 2025-04-29 at 11 38 55 PM

@Frank-Jie
Copy link
Contributor

Would you mind make a test without system prompt in messages ?
like this:

curl "http://127.0.0.1:8080/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "temperature": 0,
    "max_tokens": 500,
    "model": "deepseek-ai/DeepSeek-V3-0324",
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "query_weather",
          "description": "Get weather of a city, the user should supply a city first",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {
                "type": "string",
                "description": "The city, e.g. Beijing"
              }
            },
            "required": ["city"]
          }
        }
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "Hows the weather like in Qingdao today"
      }
    ]
  }'

Sometimes when I run the test, it outputs a lot of unknown character like: "ꯃꯂꯂꯂ\nꯂ\nꯂ\nꯂ\nꯂ"

{% set ns.system_prompt = ns.system_prompt + '\n\n' + tool_ns.text %}
{% endif %}

{{ bos_token }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{ bos_token }}

Copy link
Collaborator Author

@CatherineSue CatherineSue Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep bos_token here as the original chat template keeps it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove bos_token, sometimes I can get the reponse with random characters:
'<|tool▁calls▁begin|>晾<|tool▁calls▁begin|>ราบ房租 العصيان:\n\n```json\n{"city":"Nanjing"}\n```<|tool▁call▁end|><|tool▁calls▁end|>'

@CatherineSue
Copy link
Collaborator Author

CatherineSue commented Apr 30, 2025

Would you mind make a test without system prompt in messages ? like this:

Sometimes when I run the test, it outputs a lot of unknown character like: "ꯃꯂꯂꯂ\nꯂ\nꯂ\nꯂ\nꯂ"

I run the following script and it failed 0 times out of 100. (same as the one from #5224 (comment))

import requests, json, random, re

test_func = {
    "type": "function",
    "function": {
        "name": "query_weather",
        "description": "Get weather of an location, the user shoud supply a location first",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The city, e.g. Beijing"
                }
            },
            "required": [
                "city"
            ]
        }
    }
}


req_base_url = "http://127.0.0.1:8080"
cities = [
    "Beijing", 
    "Chongqing",
    "Chengdu",
    "Dalian",
    "Guangzhou",
    "Hangzhou",
    "Harbin",
    "Hefei",
    "Kunming",
    "Lanzhou",
    "Nanjing",
    "Qingdao",
    "Shanghai",
    "Shenzhen",
    "Suzhou",
    "Tianjin",
    "Wuhan",
    "Xi'an",
    "Xiamen",
    "Zhengzhou"
]
user_content_tp = "Hows the weather like in {} today"
req_body = {
    "messages": [
        {
            "role": "user",
            "content": ""
        }
    ],
    "temperature": 0,
    "max_tokens": 100,
    "model": "deepseek-ai/DeepSeek-V3-0324",
    "tools": [
        test_func
    ]
}

total_failed = 0
for i in range(100):
    req_body["messages"][0]["content"] = user_content_tp.format(random.choice(cities))
    res = requests.post(req_base_url + "/v1/chat/completions", json=req_body)
    res = res.json()
    if len(res["choices"]) > 0:
        if len(res["choices"][0]["message"]["tool_calls"]) > 0:
            print("function call successfull: " + json.dumps(res["choices"][0]["message"]["tool_calls"][0]))
        else:
            total_failed += 1
            print("function call failed: " + json.dumps(res))
            
    # flush cache
    requests.get(req_base_url + "/flush_cache")
print(f"total_failed: {total_failed}")

@zhaochenyang20
Copy link
Collaborator

@CatherineSue is this ready to merge?let me rerun the CI

@CatherineSue
Copy link
Collaborator Author

@zhaochenyang20 let me do more sanity check with multi-turn tool call.

For single-turn, I ran the above script for 1000 times, it failed 0.

@CatherineSue
Copy link
Collaborator Author

The single-turn function call is ready but multi-turn might need some changes.

@zhaochenyang20 Let's wait for #5907 to merge first, since deepseekv3 official chat template relies on tool_calls in assistant role message, which could be addressed in that PR. I want to do a more thorough test after that PR unless it is urgent to merge this one.

- Copy logic of tool_calls in assistant role message from original chat template (otherwise assistant can't process tool call outputs correctly)
- Improve system prompt to instruct the model to enforce tool call format
@CatherineSue
Copy link
Collaborator Author

I tried to patch #5907 to this PR and did some test on multi-turn.

Test 1: Normal multi-turn
Purpose: The model should be able to use the final tools message to generate an answer
Script:

adapted from [function_calling.ipynb](https://github.com/sgl-project/sglang/blob/main/docs/backend/function_calling.ipynb) 

import openai
import json
import random
import time
import requests

client = openai.Client(base_url="http://localhost:8080/v1", api_key="xxxxxx")
model_name = client.models.list().data[0].id

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "state": {"type": "string", "description": "State abbreviation"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["city", "state", "unit"]
            },
        },
    }
]

cities = [
    {"city": "Birmingham", "state": "AL"},
    {"city": "Anchorage", "state": "AK"},
    {"city": "Phoenix", "state": "AZ"},
    {"city": "Little Rock", "state": "AR"},
    {"city": "Los Angeles", "state": "CA"},
    {"city": "Denver", "state": "CO"},
    {"city": "Bridgeport", "state": "CT"},
    {"city": "Wilmington", "state": "DE"},
    {"city": "Miami", "state": "FL"},
    {"city": "Atlanta", "state": "GA"},
    {"city": "Honolulu", "state": "HI"},
    {"city": "Boise", "state": "ID"},
    {"city": "Chicago", "state": "IL"},
    {"city": "Indianapolis", "state": "IN"},
    {"city": "Des Moines", "state": "IA"},
    {"city": "Wichita", "state": "KS"},
    {"city": "Louisville", "state": "KY"},
    {"city": "New Orleans", "state": "LA"},
    {"city": "Portland", "state": "ME"},
    {"city": "Baltimore", "state": "MD"},
    {"city": "Boston", "state": "MA"},
    {"city": "Detroit", "state": "MI"},
    {"city": "Minneapolis", "state": "MN"},
    {"city": "Jackson", "state": "MS"},
    {"city": "Kansas City", "state": "MO"},
    {"city": "Billings", "state": "MT"},
    {"city": "Omaha", "state": "NE"},
    {"city": "Las Vegas", "state": "NV"},
    {"city": "Manchester", "state": "NH"},
    {"city": "Newark", "state": "NJ"},
    {"city": "Albuquerque", "state": "NM"},
    {"city": "New York", "state": "NY"},
    {"city": "Charlotte", "state": "NC"},
    {"city": "Fargo", "state": "ND"},
    {"city": "Columbus", "state": "OH"},
    {"city": "Oklahoma City", "state": "OK"},
    {"city": "Portland", "state": "OR"},
    {"city": "Philadelphia", "state": "PA"},
    {"city": "Providence", "state": "RI"},
    {"city": "Charleston", "state": "SC"},
    {"city": "Sioux Falls", "state": "SD"},
    {"city": "Nashville", "state": "TN"},
    {"city": "Houston", "state": "TX"},
    {"city": "Salt Lake City", "state": "UT"},
    {"city": "Burlington", "state": "VT"},
    {"city": "Virginia Beach", "state": "VA"},
    {"city": "Seattle", "state": "WA"},
    {"city": "Charleston", "state": "WV"},
    {"city": "Milwaukee", "state": "WI"},
    {"city": "Cheyenne", "state": "WY"},
    {"city": "Washington", "state": "DC"}
]

units = ["celsius", "fahrenheit"]

def get_current_weather(city, state, unit):
    if unit == "celsius":
        temp = random.randint(10, 40)
    else:
        temp = random.randint(50, 105)
    return f"The weather in {city}, {state} is {temp} degrees {unit}. It is partly cloudy.", temp

success = 0
fail = 0
num_runs = 100

for i in range(num_runs):
    try:
        loc = random.choice(cities)
        unit = random.choice(units)

        messages = [
            {"role": "system", "content": "You are a travel assistant."},
            {"role": "user", "content": f"I'm planning a trip to {loc['city']} in {loc['state']} state. What's the weather like in {unit}?"},
        ]

        response = client.chat.completions.create(
            model=model_name,
            messages=messages,
            tools=tools,
            stream=False,
            temperature=0.7,
            top_p=0.9,
            max_tokens=1000,
        )

        tool_call = response.choices[0].message.tool_calls[0]
        name = tool_call.function.name
        args = json.loads(tool_call.function.arguments)
        assert name == "get_current_weather"

        tool_call_output, temp = get_current_weather(**args)
        
        tool_call_msg = {
            "id": "0",
            "type": "function",
            "function": {
                "name": name,
                "arguments": str(args)
            }
        }

        messages.append({"role": "assistant", "content": None, "tool_calls": [tool_call_msg]})
        messages.append({"role": "tool", "name": name, "tool_call_id": tool_call.id, "content": tool_call_output})

        final = client.chat.completions.create(
            model=model_name,
            messages=messages,
            temperature=0,
            top_p=0.8,
            stream=False,
            tools=tools, # Result2 are this line commented out
            # max_tokens=300,
        )

        content = final.choices[0].message.content
        if content and str(args["city"]).lower() in content.lower() and str(temp).lower() in content.lower():
            success += 1
        else:
            fail += 1

        print(f"[{i+1}] Success={success} / Fail={fail}\n Function call result: {tool_call_output} \n Answer: {content}\n")

        try:
            requests.post("http://localhost:8080/flush_cache")
        except Exception as e:
            print(f"Warning: failed to flush cache — {e}")

        time.sleep(1)

    except requests.exceptions.ConnectionError or ConnectionError:
        print("Connection error — stopping early.")
        break
    except Exception as e:
        print(f"Unexpected error: {e}")
        fail += 1
        continue

print(f"Final Result: {success} success / {fail} failed out of {num_runs} attempts.")

Results 1:

[97] Success=97 / Fail=0
 Function call result: The weather in Wilmington, DE is 25 degrees celsius. It is partly cloudy. 
 Answer: The current weather in Wilmington, DE is 25°C and partly cloudy. Enjoy your trip!

[98] Success=98 / Fail=0
 Function call result: The weather in Honolulu, HI is 19 degrees celsius. It is partly cloudy. 
 Answer: The current weather in Honolulu, HI is 19°C and partly cloudy. Enjoy your trip!

[99] Success=99 / Fail=0
 Function call result: The weather in Newark, NJ is 56 degrees fahrenheit. It is partly cloudy. 
 Answer: The current weather in Newark, NJ is 56°F and partly cloudy. Enjoy your trip!

[100] Success=100 / Fail=0
 Function call result: The weather in Charleston, WV is 92 degrees fahrenheit. It is partly cloudy. 
 Answer: The current weather in Charleston, WV is 92°F and partly cloudy. Enjoy your trip!

Final Result: 100 success / 0 failed out of 100 attempts.

Results 2: (last turn doesn't pass tools)

[96] Success=96 / Fail=0
 Function call result: The weather in Charleston, SC is 21 degrees celsius. It is partly cloudy. 
 Answer: Currently, the weather in Charleston, SC is **21°C** and partly cloudy. It's a pleasant temperature for exploring the city! If you're planning outdoor activities, you might want to bring a light jacket for the evening, as temperatures can drop slightly. 

Would you like any recommendations for things to do based on this weather?

[97] Success=97 / Fail=0
 Function call result: The weather in Las Vegas, NV is 96 degrees fahrenheit. It is partly cloudy. 
 Answer: The current weather in Las Vegas, NV is **96°F** and partly cloudy. It's going to be a warm day, so be sure to stay hydrated and wear sunscreen if you're planning to be outdoors! Let me know if you'd like any other details for your trip.

[98] Success=98 / Fail=0
 Function call result: The weather in Nashville, TN is 29 degrees celsius. It is partly cloudy. 
 Answer: The current weather in Nashville, Tennessee, is **29°C (84°F)** with partly cloudy skies. It's a warm day, so light clothing and sunscreen would be a good idea if you're planning outdoor activities. Let me know if you'd like a forecast for your travel dates!

[99] Success=99 / Fail=0
 Function call result: The weather in Bridgeport, CT is 50 degrees fahrenheit. It is partly cloudy. 
 Answer: The current weather in Bridgeport, CT is **50°F** and partly cloudy. Enjoy your trip! If you'd like more details or a forecast for specific dates, let me know.

[100] Success=100 / Fail=0
 Function call result: The weather in Las Vegas, NV is 30 degrees celsius. It is partly cloudy. 
 Answer: The current weather in Las Vegas, Nevada is **30°C** and partly cloudy. Enjoy your trip! If you'd like more details or a forecast for specific dates, let me know.

Final Result: 100 success / 0 failed out of 100 attempts.

Test 2: Different scenario of multi-turn conversation
Purpose: Model should predict correctly whether to use tool outputs or respond naturally

Script:

import openai
import json

client = openai.Client(base_url="http://localhost:8080/v1", api_key="xxxxxx")
model_name = client.models.list().data[0].id

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "state": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city", "state", "unit"],
        },
    }
}]

def get_current_weather(city, state, unit):
    if not state:
        return "Error: State must be specified.", None
    temp = 25 if unit == "celsius" else 77
    return f"The weather in {city}, {state} is {temp} degrees {unit}. It is partly cloudy.", temp

test_cases = [
    {
        "name": "✅ Valid tool output, no re-call expected",
        "user": "I'm planning a trip to Paris. What's the weather like in celsius?",
        "tool_args": {"city": "Paris", "state": "FR", "unit": "celsius"},
        "expect_city": "Paris",
        "expect_temp": 25
    },
    {
        "name": "❌ Invalid tool output, should re-call",
        "user": "What's the weather in New York in state NY?",
        "tool_args": {"city": "New York", "state": "", "unit": "celsius"},
        "follow_up": "The state is NY",
        "expect_retry": True
    },
    {
        "name": "🔁 Multiple tool calls expected",
        "user": "What's the weather in Tokyo and London in fahrenheit?",
        "multi_tool_args": [
            {"city": "Tokyo", "state": "TY", "unit": "fahrenheit"},
            {"city": "London", "state": "UK", "unit": "celsius"},
        ],
        "expect_cities": ["Tokyo", "London"]
    },
    {
        "name": "➡️ Follow-up after tool output",
        "user": "What's the weather in Rome?",
        "tool_args": {"city": "Rome", "state": "RM", "unit": "celsius"},
        "follow_up": "What about weather in London?",
        "expect_follow": "London"
    }
]

for case in test_cases:
    print("=" * 80)
    print(f"Test: {case['name']}")
    
    messages = [
        {"role": "system", "content": "You are a travel assistant."},
        {"role": "user", "content": case["user"]}
    ]

    if case["name"].startswith("🔁 Multiple tool calls expected"):
        # First turn: ask the model
        messages = [
            {"role": "system", "content": "You are a travel assistant."},
            {"role": "user", "content": case["user"]}
        ]

        initial = client.chat.completions.create(
            model=model_name,
            messages=messages,
            tools=tools,
            stream=False,
            temperature=0,
            top_p=1,
        )

        tool_calls = initial.choices[0].message.tool_calls
        if not tool_calls or len(tool_calls) < 2:
            print("❌ Model did not emit multiple tool calls.")
            continue
        else:
            print(f"✅ Model emitted {len(tool_calls)} tool calls.")

        tool_outputs = []
        assistant_call = {
            "role": "assistant",
            "content": None,
            "tool_calls": [
                {
                    "id": tc.id,
                    "type": tc.type,
                    "function": {
                        "name": tc.function.name,
                        "arguments": tc.function.arguments
                    }
                }
                for tc in tool_calls
            ]
        }

        messages.append(assistant_call)

        for tc in tool_calls:
            args = json.loads(tc.function.arguments)
            tool_result, temp = get_current_weather(**args)
            tool_outputs.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "name": tc.function.name,
                "content": tool_result
            })
            messages.append(tool_outputs[-1])

        # Ask final generation
        final = client.chat.completions.create(
            model=model_name,
            messages=messages,
            # tools=tools,
            stream=False,
            temperature=0.5,
            top_p=1,
            max_tokens=500,
        )

        output = final.choices[0].message.content
        print(f"Model output:\n{output}\n")

        if all(city.lower() in output.lower() for city in case["expect_cities"]):
            print("✅ All cities handled in the response.")
        else:
            print("❌ Missing one or more cities in the final answer.")

    else:
        tool_output, temp = get_current_weather(**case["tool_args"])
        tool_call_msg = {
            "id": "0",
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "arguments": json.dumps(case["tool_args"])
            }
        }
        messages.append({"role": "assistant", "content": None, "tool_calls": [tool_call_msg]})
        messages.append({
            "role": "tool",
            "name": "get_current_weather",
            "tool_call_id": "0",
            "content": tool_output
        })

        if "follow_up" in case:
            messages.append({"role": "user", "content": case["follow_up"]})

        final = client.chat.completions.create(
            model=model_name,
            messages=messages,
            tools=tools,
            stream=False,
            temperature=0,
            top_p=1,
            max_tokens=500,
        )

        output = final.choices[0].message.content
        tool_calls = final.choices[0].message.tool_calls
        print(f"Model output:\n{output}\n")
        print(f"Model toolcalls:\n{tool_calls}\n")

    # Eval
    if "expect_retry" in case:
        if (output is not None and "state must be specified" in messages[-1]["content"].lower() and "new york" not in output.lower()) or tool_calls is not None:
            print("✅ Correctly did not fabricate answer.")
        else:
            print("❌ Failed to handle tool error properly.")

    elif "expect_cities" in case:
        if all(city.lower() in output.lower() for city in case["expect_cities"]):
            print("✅ All cities handled.")
        else:
            print("❌ Missing one or more city responses.")

    elif "expect_follow" in case:
        if (output is not None and case["expect_follow"] in output.lower()) or tool_calls is not None:
            print("✅ Follow-up was understood.")
        else:
            print("❌ Missed follow-up context.")

    else:
        if case["expect_city"].lower() in output.lower() and str(case["expect_temp"]) in output:
            print("✅ Weather response looks good.")
        else:
            print("❌ Model ignored tool output or responded incorrectly.")

Result:

➜  sglang git:(chang/deepseek-v3-tool-call) ✗ python3 examples/function_tool_call_test2.py
================================================================================
Test: ✅ Valid tool output, no re-call expected
Model output:
The current weather in Paris is 25°C and partly cloudy. Enjoy your trip!

Model toolcalls:
None

✅ Weather response looks good.
================================================================================
Test: ❌ Invalid tool output, should re-call
Model output:
None

Model toolcalls:
[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"city": "New York", "state": "NY", "unit": "celsius"}', name='get_current_weather'), type='function', index=None)]

✅ Correctly did not fabricate answer.
================================================================================
Test: 🔁 Multiple tool calls expected
✅ Model emitted 2 tool calls.
Model output:
The current weather in **Tokyo, Japan** is **77°F** and partly cloudy.  

In **London, UK**, it's also **77°F** with partly cloudy conditions.  

Both cities have similar weather right now!

✅ All cities handled in the response.
✅ All cities handled.
================================================================================
Test: ➡️ Follow-up after tool output
Model output:
None

Model toolcalls:
[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"city": "London", "state": "England", "unit": "celsius"}', name='get_current_weather'), type='function', index=None)]

✅ Follow-up was understood.

@zhaochenyang20
Copy link
Collaborator

I tried to patch #5907 to this PR and did some test on multi-turn.

Test 1: Normal multi-turn Purpose: The model should be able to use the final tools message to generate an answer Script:

adapted from [function_calling.ipynb](https://github.com/sgl-project/sglang/blob/main/docs/backend/function_calling.ipynb) 

import openai
import json
import random
import time
import requests

client = openai.Client(base_url="http://localhost:8080/v1", api_key="xxxxxx")
model_name = client.models.list().data[0].id

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "state": {"type": "string", "description": "State abbreviation"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["city", "state", "unit"]
            },
        },
    }
]

cities = [
    {"city": "Birmingham", "state": "AL"},
    {"city": "Anchorage", "state": "AK"},
    {"city": "Phoenix", "state": "AZ"},
    {"city": "Little Rock", "state": "AR"},
    {"city": "Los Angeles", "state": "CA"},
    {"city": "Denver", "state": "CO"},
    {"city": "Bridgeport", "state": "CT"},
    {"city": "Wilmington", "state": "DE"},
    {"city": "Miami", "state": "FL"},
    {"city": "Atlanta", "state": "GA"},
    {"city": "Honolulu", "state": "HI"},
    {"city": "Boise", "state": "ID"},
    {"city": "Chicago", "state": "IL"},
    {"city": "Indianapolis", "state": "IN"},
    {"city": "Des Moines", "state": "IA"},
    {"city": "Wichita", "state": "KS"},
    {"city": "Louisville", "state": "KY"},
    {"city": "New Orleans", "state": "LA"},
    {"city": "Portland", "state": "ME"},
    {"city": "Baltimore", "state": "MD"},
    {"city": "Boston", "state": "MA"},
    {"city": "Detroit", "state": "MI"},
    {"city": "Minneapolis", "state": "MN"},
    {"city": "Jackson", "state": "MS"},
    {"city": "Kansas City", "state": "MO"},
    {"city": "Billings", "state": "MT"},
    {"city": "Omaha", "state": "NE"},
    {"city": "Las Vegas", "state": "NV"},
    {"city": "Manchester", "state": "NH"},
    {"city": "Newark", "state": "NJ"},
    {"city": "Albuquerque", "state": "NM"},
    {"city": "New York", "state": "NY"},
    {"city": "Charlotte", "state": "NC"},
    {"city": "Fargo", "state": "ND"},
    {"city": "Columbus", "state": "OH"},
    {"city": "Oklahoma City", "state": "OK"},
    {"city": "Portland", "state": "OR"},
    {"city": "Philadelphia", "state": "PA"},
    {"city": "Providence", "state": "RI"},
    {"city": "Charleston", "state": "SC"},
    {"city": "Sioux Falls", "state": "SD"},
    {"city": "Nashville", "state": "TN"},
    {"city": "Houston", "state": "TX"},
    {"city": "Salt Lake City", "state": "UT"},
    {"city": "Burlington", "state": "VT"},
    {"city": "Virginia Beach", "state": "VA"},
    {"city": "Seattle", "state": "WA"},
    {"city": "Charleston", "state": "WV"},
    {"city": "Milwaukee", "state": "WI"},
    {"city": "Cheyenne", "state": "WY"},
    {"city": "Washington", "state": "DC"}
]

units = ["celsius", "fahrenheit"]

def get_current_weather(city, state, unit):
    if unit == "celsius":
        temp = random.randint(10, 40)
    else:
        temp = random.randint(50, 105)
    return f"The weather in {city}, {state} is {temp} degrees {unit}. It is partly cloudy.", temp

success = 0
fail = 0
num_runs = 100

for i in range(num_runs):
    try:
        loc = random.choice(cities)
        unit = random.choice(units)

        messages = [
            {"role": "system", "content": "You are a travel assistant."},
            {"role": "user", "content": f"I'm planning a trip to {loc['city']} in {loc['state']} state. What's the weather like in {unit}?"},
        ]

        response = client.chat.completions.create(
            model=model_name,
            messages=messages,
            tools=tools,
            stream=False,
            temperature=0.7,
            top_p=0.9,
            max_tokens=1000,
        )

        tool_call = response.choices[0].message.tool_calls[0]
        name = tool_call.function.name
        args = json.loads(tool_call.function.arguments)
        assert name == "get_current_weather"

        tool_call_output, temp = get_current_weather(**args)
        
        tool_call_msg = {
            "id": "0",
            "type": "function",
            "function": {
                "name": name,
                "arguments": str(args)
            }
        }

        messages.append({"role": "assistant", "content": None, "tool_calls": [tool_call_msg]})
        messages.append({"role": "tool", "name": name, "tool_call_id": tool_call.id, "content": tool_call_output})

        final = client.chat.completions.create(
            model=model_name,
            messages=messages,
            temperature=0,
            top_p=0.8,
            stream=False,
            tools=tools, # Result2 are this line commented out
            # max_tokens=300,
        )

        content = final.choices[0].message.content
        if content and str(args["city"]).lower() in content.lower() and str(temp).lower() in content.lower():
            success += 1
        else:
            fail += 1

        print(f"[{i+1}] Success={success} / Fail={fail}\n Function call result: {tool_call_output} \n Answer: {content}\n")

        try:
            requests.post("http://localhost:8080/flush_cache")
        except Exception as e:
            print(f"Warning: failed to flush cache — {e}")

        time.sleep(1)

    except requests.exceptions.ConnectionError or ConnectionError:
        print("Connection error — stopping early.")
        break
    except Exception as e:
        print(f"Unexpected error: {e}")
        fail += 1
        continue

print(f"Final Result: {success} success / {fail} failed out of {num_runs} attempts.")

Results 1:

[97] Success=97 / Fail=0
 Function call result: The weather in Wilmington, DE is 25 degrees celsius. It is partly cloudy. 
 Answer: The current weather in Wilmington, DE is 25°C and partly cloudy. Enjoy your trip!

[98] Success=98 / Fail=0
 Function call result: The weather in Honolulu, HI is 19 degrees celsius. It is partly cloudy. 
 Answer: The current weather in Honolulu, HI is 19°C and partly cloudy. Enjoy your trip!

[99] Success=99 / Fail=0
 Function call result: The weather in Newark, NJ is 56 degrees fahrenheit. It is partly cloudy. 
 Answer: The current weather in Newark, NJ is 56°F and partly cloudy. Enjoy your trip!

[100] Success=100 / Fail=0
 Function call result: The weather in Charleston, WV is 92 degrees fahrenheit. It is partly cloudy. 
 Answer: The current weather in Charleston, WV is 92°F and partly cloudy. Enjoy your trip!

Final Result: 100 success / 0 failed out of 100 attempts.

Results 2: (last turn doesn't pass tools)

[96] Success=96 / Fail=0
 Function call result: The weather in Charleston, SC is 21 degrees celsius. It is partly cloudy. 
 Answer: Currently, the weather in Charleston, SC is **21°C** and partly cloudy. It's a pleasant temperature for exploring the city! If you're planning outdoor activities, you might want to bring a light jacket for the evening, as temperatures can drop slightly. 

Would you like any recommendations for things to do based on this weather?

[97] Success=97 / Fail=0
 Function call result: The weather in Las Vegas, NV is 96 degrees fahrenheit. It is partly cloudy. 
 Answer: The current weather in Las Vegas, NV is **96°F** and partly cloudy. It's going to be a warm day, so be sure to stay hydrated and wear sunscreen if you're planning to be outdoors! Let me know if you'd like any other details for your trip.

[98] Success=98 / Fail=0
 Function call result: The weather in Nashville, TN is 29 degrees celsius. It is partly cloudy. 
 Answer: The current weather in Nashville, Tennessee, is **29°C (84°F)** with partly cloudy skies. It's a warm day, so light clothing and sunscreen would be a good idea if you're planning outdoor activities. Let me know if you'd like a forecast for your travel dates!

[99] Success=99 / Fail=0
 Function call result: The weather in Bridgeport, CT is 50 degrees fahrenheit. It is partly cloudy. 
 Answer: The current weather in Bridgeport, CT is **50°F** and partly cloudy. Enjoy your trip! If you'd like more details or a forecast for specific dates, let me know.

[100] Success=100 / Fail=0
 Function call result: The weather in Las Vegas, NV is 30 degrees celsius. It is partly cloudy. 
 Answer: The current weather in Las Vegas, Nevada is **30°C** and partly cloudy. Enjoy your trip! If you'd like more details or a forecast for specific dates, let me know.

Final Result: 100 success / 0 failed out of 100 attempts.

Test 2: Different scenario of multi-turn conversation Purpose: Model should predict correctly whether to use tool outputs or respond naturally

Script:

import openai
import json

client = openai.Client(base_url="http://localhost:8080/v1", api_key="xxxxxx")
model_name = client.models.list().data[0].id

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "state": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city", "state", "unit"],
        },
    }
}]

def get_current_weather(city, state, unit):
    if not state:
        return "Error: State must be specified.", None
    temp = 25 if unit == "celsius" else 77
    return f"The weather in {city}, {state} is {temp} degrees {unit}. It is partly cloudy.", temp

test_cases = [
    {
        "name": "✅ Valid tool output, no re-call expected",
        "user": "I'm planning a trip to Paris. What's the weather like in celsius?",
        "tool_args": {"city": "Paris", "state": "FR", "unit": "celsius"},
        "expect_city": "Paris",
        "expect_temp": 25
    },
    {
        "name": "❌ Invalid tool output, should re-call",
        "user": "What's the weather in New York in state NY?",
        "tool_args": {"city": "New York", "state": "", "unit": "celsius"},
        "follow_up": "The state is NY",
        "expect_retry": True
    },
    {
        "name": "🔁 Multiple tool calls expected",
        "user": "What's the weather in Tokyo and London in fahrenheit?",
        "multi_tool_args": [
            {"city": "Tokyo", "state": "TY", "unit": "fahrenheit"},
            {"city": "London", "state": "UK", "unit": "celsius"},
        ],
        "expect_cities": ["Tokyo", "London"]
    },
    {
        "name": "➡️ Follow-up after tool output",
        "user": "What's the weather in Rome?",
        "tool_args": {"city": "Rome", "state": "RM", "unit": "celsius"},
        "follow_up": "What about weather in London?",
        "expect_follow": "London"
    }
]

for case in test_cases:
    print("=" * 80)
    print(f"Test: {case['name']}")
    
    messages = [
        {"role": "system", "content": "You are a travel assistant."},
        {"role": "user", "content": case["user"]}
    ]

    if case["name"].startswith("🔁 Multiple tool calls expected"):
        # First turn: ask the model
        messages = [
            {"role": "system", "content": "You are a travel assistant."},
            {"role": "user", "content": case["user"]}
        ]

        initial = client.chat.completions.create(
            model=model_name,
            messages=messages,
            tools=tools,
            stream=False,
            temperature=0,
            top_p=1,
        )

        tool_calls = initial.choices[0].message.tool_calls
        if not tool_calls or len(tool_calls) < 2:
            print("❌ Model did not emit multiple tool calls.")
            continue
        else:
            print(f"✅ Model emitted {len(tool_calls)} tool calls.")

        tool_outputs = []
        assistant_call = {
            "role": "assistant",
            "content": None,
            "tool_calls": [
                {
                    "id": tc.id,
                    "type": tc.type,
                    "function": {
                        "name": tc.function.name,
                        "arguments": tc.function.arguments
                    }
                }
                for tc in tool_calls
            ]
        }

        messages.append(assistant_call)

        for tc in tool_calls:
            args = json.loads(tc.function.arguments)
            tool_result, temp = get_current_weather(**args)
            tool_outputs.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "name": tc.function.name,
                "content": tool_result
            })
            messages.append(tool_outputs[-1])

        # Ask final generation
        final = client.chat.completions.create(
            model=model_name,
            messages=messages,
            # tools=tools,
            stream=False,
            temperature=0.5,
            top_p=1,
            max_tokens=500,
        )

        output = final.choices[0].message.content
        print(f"Model output:\n{output}\n")

        if all(city.lower() in output.lower() for city in case["expect_cities"]):
            print("✅ All cities handled in the response.")
        else:
            print("❌ Missing one or more cities in the final answer.")

    else:
        tool_output, temp = get_current_weather(**case["tool_args"])
        tool_call_msg = {
            "id": "0",
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "arguments": json.dumps(case["tool_args"])
            }
        }
        messages.append({"role": "assistant", "content": None, "tool_calls": [tool_call_msg]})
        messages.append({
            "role": "tool",
            "name": "get_current_weather",
            "tool_call_id": "0",
            "content": tool_output
        })

        if "follow_up" in case:
            messages.append({"role": "user", "content": case["follow_up"]})

        final = client.chat.completions.create(
            model=model_name,
            messages=messages,
            tools=tools,
            stream=False,
            temperature=0,
            top_p=1,
            max_tokens=500,
        )

        output = final.choices[0].message.content
        tool_calls = final.choices[0].message.tool_calls
        print(f"Model output:\n{output}\n")
        print(f"Model toolcalls:\n{tool_calls}\n")

    # Eval
    if "expect_retry" in case:
        if (output is not None and "state must be specified" in messages[-1]["content"].lower() and "new york" not in output.lower()) or tool_calls is not None:
            print("✅ Correctly did not fabricate answer.")
        else:
            print("❌ Failed to handle tool error properly.")

    elif "expect_cities" in case:
        if all(city.lower() in output.lower() for city in case["expect_cities"]):
            print("✅ All cities handled.")
        else:
            print("❌ Missing one or more city responses.")

    elif "expect_follow" in case:
        if (output is not None and case["expect_follow"] in output.lower()) or tool_calls is not None:
            print("✅ Follow-up was understood.")
        else:
            print("❌ Missed follow-up context.")

    else:
        if case["expect_city"].lower() in output.lower() and str(case["expect_temp"]) in output:
            print("✅ Weather response looks good.")
        else:
            print("❌ Model ignored tool output or responded incorrectly.")

Result:

➜  sglang git:(chang/deepseek-v3-tool-call) ✗ python3 examples/function_tool_call_test2.py
================================================================================
Test: ✅ Valid tool output, no re-call expected
Model output:
The current weather in Paris is 25°C and partly cloudy. Enjoy your trip!

Model toolcalls:
None

✅ Weather response looks good.
================================================================================
Test: ❌ Invalid tool output, should re-call
Model output:
None

Model toolcalls:
[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"city": "New York", "state": "NY", "unit": "celsius"}', name='get_current_weather'), type='function', index=None)]

✅ Correctly did not fabricate answer.
================================================================================
Test: 🔁 Multiple tool calls expected
✅ Model emitted 2 tool calls.
Model output:
The current weather in **Tokyo, Japan** is **77°F** and partly cloudy.  

In **London, UK**, it's also **77°F** with partly cloudy conditions.  

Both cities have similar weather right now!

✅ All cities handled in the response.
✅ All cities handled.
================================================================================
Test: ➡️ Follow-up after tool output
Model output:
None

Model toolcalls:
[ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"city": "London", "state": "England", "unit": "celsius"}', name='get_current_weather'), type='function', index=None)]

✅ Follow-up was understood.

is it all set?

@CatherineSue
Copy link
Collaborator Author

@zhaochenyang20 yes we can merge this.

@zhaochenyang20 zhaochenyang20 merged commit 170d1f2 into main May 2, 2025
38 of 42 checks passed
@zhaochenyang20 zhaochenyang20 deleted the chang/deepseek-v3-tool-call branch May 2, 2025 04:28
tarinkk pushed a commit to tarinkk/sglang that referenced this pull request May 9, 2025
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request May 23, 2025
* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728)

* [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722)

* Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720)

* perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716)

* we fix the non existent access of `decrypted_config_file` (sgl-project#5685)

* CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682)

* Fuse MLA set kv cache kernel (sgl-project#5748)

* Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697)

* [feature] support for roberta embedding models (sgl-project#5730)

* [fix] fix bench_one_batch_server (sgl-project#5607)

* support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592)

* fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687)

* Add Llama 4 to FA3 test (sgl-project#5509)

* [misc] more decode step log for batch_one_batch (sgl-project#5565)

* Handle JSONDecodeError while processing request data (sgl-project#5599)

* fix(srt): check if sample_indices is not None before usage. (sgl-project#5633)

* update llguidance to 0.7.11; adds StructTag (sgl-project#4870)

* Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971)

* Add memory_saver check (sgl-project#4986)

Signed-off-by: Kebe <[email protected]>

* add switch to disable open api doc (sgl-project#3744)

Signed-off-by: congcongke <[email protected]>

* Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772)

* Fix eagle test case (sgl-project#5776)

* Split local attention test from fa3 test (sgl-project#5774)

* Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777)

* Simplify FA3 tests (sgl-project#5779)

* Revert "[fix] fix bench_one_batch_server" (sgl-project#5785)

* Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786)

* [CI] Tune threshold (sgl-project#5787)

* [CI] fix port conflicts (sgl-project#5789)

* [CI] Fix ci tests (sgl-project#5769)

* [PD]Reduce kv transfer threads (sgl-project#5791)

* [CI] Fix test case (sgl-project#5790)

* Add 8-GPU Test for Deepseek-V3  (sgl-project#5691)

Co-authored-by: Lianmin Zheng <[email protected]>

* Release v0.4.6 (sgl-project#5795)

* Update nightly-test.yml (sgl-project#5797)

* [CI] Improve github summary & enable fa3 for more models (sgl-project#5796)

* [Docs] update grafana setup guide in production metrics (sgl-project#5643)

Co-authored-by: NoahM <[email protected]>

* [Misc] add structure logging, write to file and log tracing for SGL Router

* Improve overlap scheduling (sgl-project#5788)

* Add Cutlass MLA attention backend (sgl-project#5390)

* chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690)

* Dockerfile.dev pip scikit_build_core (sgl-project#5807)

* Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809)

* Turn on overlap scheduler for multimodal models (sgl-project#5771)

* Tiny refactor DefaultModelLoader.Source (sgl-project#5482)

* [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276)

* Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825)

* Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551)

Co-authored-by: shuaills <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838)

* feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833)

* fused moe triton tuning script support qwen3 (sgl-project#5842)

* feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839)

* [PD] support pd fake transfer for warmup (sgl-project#5726)

* [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846)

* [Doc] Recover history of server_arguments.md (sgl-project#5851)

* feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850)

* [CI] test chunked prefill more (sgl-project#5798)

* ROCm: update AITER (sgl-project#5816)

* [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847)

Co-authored-by: sighingnow <[email protected]>

* [Fix] Missing bootstrap_port field (sgl-project#5823)

* feat: update is_fa3_default_architecture (sgl-project#5854)

* add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849)

* chore: bump v0.4.6.post1 (sgl-project#5845)

* Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857)

* simplify fused_moe config logging (sgl-project#5801)

* [CI] tune the test order to warmup the server (sgl-project#5860)

* Cutlass MLA decode - fix dtype error (sgl-project#5868)

* cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820)

* [Feature] support auto chat template (sgl-project#4949)

* Feat: support cuda graph for LoRA (sgl-project#4115)

Co-authored-by: Beichen Ma <[email protected]>

* Add qwen3 30b fused moe config (sgl-project#5859)

* [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875)

Co-authored-by: pengcuo <[email protected]>

* Add A800 fused moe config for qwen3 30b (sgl-project#5880)

* [Misc] add service discovery for sgl router

* [fix]: PyO3 macOS linking and consolidate on tracing for logging

* chore: update Dockerfile (sgl-project#5894)

* [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836)

* [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841)

* chore: update CODEOWNERS (sgl-project#5895)

* [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746)

* [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896)

* Auto set draft model path for MTP (sgl-project#5793)

* [fix] relax mem_fraction_static for h200 (sgl-project#5893)

Co-authored-by: alcanerian <[email protected]>

* feat: support pythonic tool call and index in tool call streaming (sgl-project#5725)

* [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696)

* Add AMD MI300x Nightly Testing. (sgl-project#5861)

* chore: use torch 2.6 for sgl-kernel build (sgl-project#5898)

* Fix check_env script (sgl-project#5901)

* [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830)

* Bump Flashinfer to 0.2.5 (sgl-project#5870)

Co-authored-by: Yuhao Chen <[email protected]>

* [Fix] Unload lora in HF_Runner if needed (sgl-project#5899)

* Add A800 fused moe config for qwen3 235b (sgl-project#5900)

* Add sm_120 for blackwell (sgl-project#5903)

* [Feature] add support kimi vl model (sgl-project#5383)

Co-authored-by: wenju.li <[email protected]>

* support vlm benchmark profile (sgl-project#5905)

* [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910)

* [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919)

* [qwen3] support qwen3 ep moe (sgl-project#5917)

Co-authored-by: sleepcoo <[email protected]>

* Add TP2 MOE benchmarks for AMD. (sgl-project#5909)

* [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912)

Co-authored-by: zhyncs <[email protected]>

* chore: bump sgl-kernel 0.1.1 (sgl-project#5932)

* chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933)

* Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783)

* [PP] Add pipeline parallelism (sgl-project#5724)

* Fix lora batch processing when input lora_path contains None (sgl-project#5930)

* add Thor & Spark (sgl-project#5915)

* fix: correct stream response when enable_thinking is set to false (sgl-project#5881)

* fix: update model runner (sgl-project#5934)

* chore: bump v0.4.6.post2 (sgl-project#5939)

* Support XiaomiMiMo/MiMo model inference (sgl-project#5921)

* [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834)

Co-authored-by: luoyuan.luo <[email protected]>

* Remove extra contiguous (sgl-project#5953)

* Update ci test and doc for MTP api change (sgl-project#5952)

* docs: Fix Qwen model typo (sgl-project#5944)

Signed-off-by: JiangJiaWei1103 <[email protected]>

* Optimize a pad operation to accelerate 25us (sgl-project#5945)

* Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956)

* feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782)

* Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960)

* feat: Refactor DeepSeekV3 function call (sgl-project#5908)

* Remove token in token out in Native API (sgl-project#5967)

* Support InternVL3 (sgl-project#5350)

Co-authored-by: Mick <[email protected]>
Co-authored-by: Chayenne <[email protected]>

* Support MMMU benchmark for  InternVL (sgl-project#5968)

* FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969)

Signed-off-by: Lifu Huang <[email protected]>

* [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681)

* Fix set kv cache multi-stream (sgl-project#5975)

* Overlap qk norm with two streams (sgl-project#5977)

* fix: only upgrade nccl for cu128 (sgl-project#5986)

* Fix Phi3 serving which was broke by earlier change (sgl-project#5991)

Co-authored-by: Lifu Huang <[email protected]>

* [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998)

* [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992)

* [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012)

Signed-off-by: Lifu Huang <[email protected]>

* Fix flaky issues of lora and add multi batch tests (sgl-project#5957)

* Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679)

* fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997)

* [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002)

* Update dev container config to support live code sync and improve docker setup guide   (sgl-project#6018)

Signed-off-by: Lifu Huang <[email protected]>

* [PD] Optimize disaggregation ib device help info (sgl-project#5781)

* [Test] Add flashmla attention backend test (sgl-project#5587)

* Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555)

* feat: Add a unified merge_state API (sgl-project#5428)

* feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996)

* [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752)

* Fix prefill OOM error in the case of large page size (sgl-project#5081)

* Fix problem of large page size with chunked prefill (sgl-project#6046)

* docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047)

* docs: add new blog (sgl-project#6048)

* Fix not "import os" (sgl-project#6057)

* Better PD initialization (sgl-project#5751)

* fix: deepep dockerfile, use pip install deepep. (sgl-project#5885)

* [Fix] Fix and rename flashmla CI test (sgl-project#6045)

* chore: upgrade cutlass 3.9.2 (sgl-project#6004)

Co-authored-by: yizhang2077 <[email protected]>

* Fix sgl-kernel build on aarch64 platforms (sgl-project#6062)

* Add DeepEP to CI PR Test (sgl-project#5655)

Co-authored-by: Jinyan Chen <[email protected]>

* fix custom_allreduce namespace (sgl-project#6039)

* feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010)

Co-authored-by: Qiaolin-Yu <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>

* [Feature] Support for Ascend NPU backend (sgl-project#3853)

Signed-off-by: Song Zhang <[email protected]>
Co-authored-by: 22dimensions <[email protected]>

* Fix the timeout for 8 gpu tests (sgl-project#6084)

* Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014)

* Super tiny fix doc (sgl-project#5233)

* [Doc]Fix description for dp_size argument (sgl-project#6063)

* feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075)

* [refactor] slightly tidy fp8 module (sgl-project#5993)

* Clean up fa3 test from 8 gpus (sgl-project#6105)

* Deferring 8 GPU test (sgl-project#6102)

* Update doc for MLA attention backends (sgl-project#6034)

* Clean logs for DeepSeek-V3 launching (sgl-project#6079)

* [CI]Add performance CI for VLM (sgl-project#6038)

Signed-off-by: Xinyuan Tong <[email protected]>

* adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111)

* optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077)

* Overlap shared expert and routed expert computations (sgl-project#5121)

* Tiny refactor ModelConfig.from_server_args (sgl-project#5219)

* Tiny refactor weight loading logic (sgl-project#5232)

* [PD] Add control to slow down a server (sgl-project#5572)

* Change AMD test threshold (sgl-project#6091)

* DeepEP normal support deepgemm-contiguous (sgl-project#5626)

Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Xuting Zhou <[email protected]>
Co-authored-by: ZhengHSI <[email protected]>

* [fix] fix pyproject.toml dependencies (sgl-project#6119)

* [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764)

Co-authored-by: othame <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>

* [perf] dsv3 bmm fallback to bf16 (sgl-project#5662)

* [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097)

* [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123)

Co-authored-by: zhyncs <[email protected]>

* upgrade xgrammar to 0.1.19 (sgl-project#6129)

* Remove unecessary is_fa3_supported check (sgl-project#6112)

* chore: bump sgl-kernel 0.1.2 (sgl-project#6131)

* docs: update README (sgl-project#6132)

* [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745)

* Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101)

* opt flashinfer mla cat (sgl-project#5822)

Co-authored-by: xuyongfei.xyf <[email protected]>

* Update amd nightly concurrency. (sgl-project#6141)

* feat: add thinking_budget (sgl-project#6089)

* [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162)

* fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778)

Co-authored-by: Zhiqiang Xie <[email protected]>

* chore: bump v0.4.6.post3 (sgl-project#6165)

* KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016)

Co-authored-by: 继优 <[email protected]>
Co-authored-by: chus-chus <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>

* [fix] fix determine_n_share_experts_fusion (sgl-project#6118)

* Fix and Clean up chat-template requirement for VLM (sgl-project#6114)

Signed-off-by: Xinyuan Tong <[email protected]>

* [Docs]Delete duplicate content (sgl-project#6146)

Co-authored-by: ximing.wxm <[email protected]>

* Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181)

* Added async_encode method to Engine (sgl-project#4701)

* Fix data parallel perf regression (sgl-project#6183)

* Fix request abortion (sgl-project#6184)

* Add typo checker in pre-commit (sgl-project#6179)

Co-authored-by: Brayden Zhong <[email protected]>

* Remove duplicate IO Struct test (sgl-project#6180)

Signed-off-by: Emmanuel Ferdman <[email protected]>

* [PD] Add simple unit test for disaggregation feature (sgl-project#5654)

Signed-off-by: Shangming Cai <[email protected]>

* [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186)

* feat: support loogle eval (sgl-project#6190)

* [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191)

* fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169)

* chore: upgrade deepgemm (sgl-project#6073)

* chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195)

* chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196)

Co-authored-by: alcanderian <[email protected]>

* Handle empty input string for embedding models (sgl-project#5621)

Co-authored-by: Ravi Theja Desetty <[email protected]>

* doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199)

* [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032)

* Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188)

* [CI] Reorganize the 8 gpu tests (sgl-project#6192)

* Add dev-deepep docker image (sgl-project#6198)

* Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178)

Signed-off-by: Lifu Huang <[email protected]>

* Update README.md (sgl-project#6202)

* Fix release-docs.yml to not use python 3.9 (sgl-project#6204)

* Fix start_profile does not support with_stack and record_shapes (sgl-project#6043)

* [doc] add a note for --n-share-experts-fusion args (sgl-project#6154)

* Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558)

Co-authored-by: liusy58 <[email protected]>

* Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213)

* Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201)

Co-authored-by: SangBin Cho <[email protected]>

* [CI] Fix PD mooncake dependency error (sgl-project#6212)

Signed-off-by: Shangming Cai <[email protected]>

* [CI] Re-enable pd disaggregation test (sgl-project#6231)

Signed-off-by: Shangming Cai <[email protected]>

* fix some typos (sgl-project#6209)

Co-authored-by: Brayden Zhong <[email protected]>

* [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206)

* [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223)

* Revert "fix some typos" (sgl-project#6244)

* chore: add hf_xet dep (sgl-project#6243)

* Update AMD nightly deps. (sgl-project#6241)

* [PD] Add support for different TP sizes per DP rank (sgl-project#5922)

Signed-off-by: Shangming Cai <[email protected]>

* Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225)

Co-authored-by: SangBin Cho <[email protected]>

* fix typo (sgl-project#6248)

* Support tuning moe for llama 4 model (sgl-project#6042)

* Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251)

* [Llama4] Add docs note about enable multimodal (sgl-project#6235)

* [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247)

* Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657)

Co-authored-by: liusy58 <[email protected]>
Co-authored-by: 颉沆 <[email protected]>

* model(vlm): pixtral (sgl-project#5084)

* [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252)

* Enable MI325X AMD CI. (sgl-project#6259)

* chore: bump v0.4.6.post4 (sgl-project#6245)

* formatting fix for the rebased commit for 4.6.0_post4

Signed-off-by: Mohit Sinha <[email protected]>

* fix issues in model runner and python packages

fix for following issues:
> vLLM dependency for xgrammar==0.1.17
> 'Scheduler' object has no attribute 'device
> 'pp_proxy_tensors' unexpected arg in HPUGraphRunner
> TODO: Add pipeline parallelism support in HPUGraphRunner

Signed-off-by: Mohit Sinha <[email protected]>

* fix formatting in model runner

Signed-off-by: Mohit Sinha <[email protected]>

* base grammar fix for the is_terminated case

>  'OutlinesGrammar' object has no attribute 'is_terminated'

Signed-off-by: Mohit Sinha <[email protected]>

---------

Signed-off-by: Kebe <[email protected]>
Signed-off-by: congcongke <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Signed-off-by: Lifu Huang <[email protected]>
Signed-off-by: Song Zhang <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Emmanuel Ferdman <[email protected]>
Signed-off-by: Shangming Cai <[email protected]>
Signed-off-by: Mohit Sinha <[email protected]>
Co-authored-by: Wenxuan Tan <[email protected]>
Co-authored-by: JieXin Liang <[email protected]>
Co-authored-by: Yuhong Guo <[email protected]>
Co-authored-by: saltyfish66 <[email protected]>
Co-authored-by: vzed <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: saienduri <[email protected]>
Co-authored-by: DavidBao <[email protected]>
Co-authored-by: Frankey_8080 <[email protected]>
Co-authored-by: Stefan He <[email protected]>
Co-authored-by: yan97ao <[email protected]>
Co-authored-by: aoshen524 <[email protected]>
Co-authored-by: Michał Moskal <[email protected]>
Co-authored-by: lambert0312 <[email protected]>
Co-authored-by: Kebe <[email protected]>
Co-authored-by: zhanweidu <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Huapeng Zhou <[email protected]>
Co-authored-by: NoahM <[email protected]>
Co-authored-by: Simo Lin <[email protected]>
Co-authored-by: Trevor Morris <[email protected]>
Co-authored-by: Yineng Zhang <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: Michael Yao <[email protected]>
Co-authored-by: mlmz <[email protected]>
Co-authored-by: shuaills <[email protected]>
Co-authored-by: Chayenne <[email protected]>
Co-authored-by: XinyuanTong <[email protected]>
Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: ybyang <[email protected]>
Co-authored-by: JiLi <[email protected]>
Co-authored-by: HAI <[email protected]>
Co-authored-by: PGFLMG <[email protected]>
Co-authored-by: sighingnow <[email protected]>
Co-authored-by: XTY <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: woodx <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: Beichen Ma <[email protected]>
Co-authored-by: pengcuo <[email protected]>
Co-authored-by: pengcuo <[email protected]>
Co-authored-by: Adarsh Shirawalmath <[email protected]>
Co-authored-by: simveit <[email protected]>
Co-authored-by: Johnny <[email protected]>
Co-authored-by: alcanerian <[email protected]>
Co-authored-by: Yuhao Chen <[email protected]>
Co-authored-by: zhjunqin <[email protected]>
Co-authored-by: liwenju0 <[email protected]>
Co-authored-by: wenju.li <[email protected]>
Co-authored-by: laixin <[email protected]>
Co-authored-by: sleepcoo <[email protected]>
Co-authored-by: Ying Sheng <[email protected]>
Co-authored-by: ryang <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: luoyuan.luo <[email protected]>
Co-authored-by: 江家瑋 <[email protected]>
Co-authored-by: KCFindstr <[email protected]>
Co-authored-by: xm:D <[email protected]>
Co-authored-by: Lifu Huang <[email protected]>
Co-authored-by: Yongtong Wu <[email protected]>
Co-authored-by: Junrong Lin <[email protected]>
Co-authored-by: shangmingc <[email protected]>
Co-authored-by: DefTruth <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>
Co-authored-by: Hank Han <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: Jinyan Chen <[email protected]>
Co-authored-by: Jinyan Chen <[email protected]>
Co-authored-by: Johnny <[email protected]>
Co-authored-by: Song Zhang <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: ishandhanani <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Minglei Zhu <[email protected]>
Co-authored-by: lukec <[email protected]>
Co-authored-by: Yingyi Huang <[email protected]>
Co-authored-by: Xuting Zhou <[email protected]>
Co-authored-by: ZhengHSI <[email protected]>
Co-authored-by: Zhu Chen <[email protected]>
Co-authored-by: othame <[email protected]>
Co-authored-by: Hubert Lu <[email protected]>
Co-authored-by: Yixin Dong <[email protected]>
Co-authored-by: xu-yfei <[email protected]>
Co-authored-by: xuyongfei.xyf <[email protected]>
Co-authored-by: thyecust <[email protected]>
Co-authored-by: huangtingwei <[email protected]>
Co-authored-by: Simon (Jiyou) Li <[email protected]>
Co-authored-by: 继优 <[email protected]>
Co-authored-by: chus-chus <[email protected]>
Co-authored-by: Ximingwang-09 <[email protected]>
Co-authored-by: ximing.wxm <[email protected]>
Co-authored-by: Steven Shimizu <[email protected]>
Co-authored-by: applesaucethebun <[email protected]>
Co-authored-by: Brayden Zhong <[email protected]>
Co-authored-by: Emmanuel Ferdman <[email protected]>
Co-authored-by: Yusong Gao <[email protected]>
Co-authored-by: alcanderian <[email protected]>
Co-authored-by: Ravi Theja <[email protected]>
Co-authored-by: Ravi Theja Desetty <[email protected]>
Co-authored-by: liusy58 <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: 颉沆 <[email protected]>
Co-authored-by: Kiv Chen <[email protected]>
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants