π νκ΅μ΄ README
A Flask-based proxy server that enables seamless integration of AI models with different tool call formats by automatically converting them to OpenAI's standard format. Perfect for using models like GLM with OpenAI-compatible clients.
I don't do vibe coding, generally, but this seemed like a decent place to get my feet wet. I've added Qwen3-Coder and Qwen3 parsing through codex and GPT-OSS-20B. It'll also pick up the .env file which is nice.
- π Automatic Tool Call Conversion: Converts model-specific tool call formats (like GLM's
<tool_call>syntax) to OpenAI's standard format - β‘ Streaming Support: Full support for both streaming and non-streaming responses
- π― Model-Specific Handling: Modular converter system that automatically detects and handles different model formats
- π Full OpenAI API Compatibility: Supports all major endpoints (
/v1/chat/completions,/v1/completions,/v1/models,/v1/embeddings) - π§ Configurable: Environment-based configuration for easy deployment
- π§© Extensible: Easy to add support for new model formats
- Python 3.7+
- A running AI model server (LM Studio, Ollama, etc.)
-
Clone and setup:
git clone <repository-url> cd proxy pip install -r requirements.txt
-
Configure your backend (optional):
# Set environment variables or modify config.py export BACKEND_HOST=localhost export BACKEND_PORT=8888
-
Start the proxy:
python app.py
-
Use with any OpenAI-compatible client:
from openai import OpenAI client = OpenAI(base_url="http://localhost:5000", api_key="your-key") response = client.chat.completions.create( model="glm-4.5-air-hi-mlx", messages=[{"role": "user", "content": "Search for information about Python"}], tools=[{ "type": "function", "function": { "name": "search_wikipedia", "description": "Search Wikipedia", "parameters": { "type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"] } } }] )
| Model Type | Tool Call Format | Status |
|---|---|---|
| GLM Models | <tool_call> syntax |
β Full Support |
| OpenAI Models | Standard format | β Pass-through |
| Claude Models | <invoke> syntax |
β Example Implementation |
Input (GLM format):
I'll search for that information.
<tool_call>fetch_wikipedia_content
<arg_key>search_query</arg_key>
<arg_value>Python programming</arg_value>
</tool_call>
Output (OpenAI format):
{
"tool_calls": [{
"id": "123456789",
"type": "function",
"function": {
"name": "fetch_wikipedia_content",
"arguments": "{\"search_query\": \"Python programming\"}"
}
}],
"finish_reason": "tool_calls"
}| Variable | Default | Description |
|---|---|---|
BACKEND_HOST |
localhost |
Backend server hostname |
BACKEND_PORT |
8888 |
Backend server port |
BACKEND_PROTOCOL |
http |
Backend protocol |
PROXY_HOST |
0.0.0.0 |
Proxy server bind address |
PROXY_PORT |
5000 |
Proxy server port |
REQUEST_TIMEOUT |
3600 |
Regular request timeout (seconds) |
STREAMING_TIMEOUT |
3600 |
Streaming request timeout (seconds, use 'none' to disable) |
ENABLE_TOOL_CALL_CONVERSION |
true |
Enable/disable tool call conversion |
REMOVE_THINK_TAGS |
true |
Remove complete <think>...</think> blocks from responses |
LOG_LEVEL |
INFO |
Logging level |
FLASK_ENV |
development |
Environment (development/production/testing) |
Create a .env file or modify config.py directly:
# config.py
BACKEND_HOST = 'localhost'
BACKEND_PORT = 8888
PROXY_PORT = 5000
ENABLE_TOOL_CALL_CONVERSION = True
REMOVE_THINK_TAGS = True # Set to False to preserve <think> contentfrom config import get_backend_config
# Use with LM Studio
lmstudio_config = get_backend_config('lmstudio') # localhost:8888
# Use with Ollama
ollama_config = get_backend_config('ollama') # localhost:11434
# Use with OpenAI API
openai_config = get_backend_config('openai') # api.openai.com:443βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Client βββββΆβ Proxy Server βββββΆβ Backend Model β
β (OpenAI API) β β β β (GLM/etc.) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β Converter Factory β
β - GLM Converter β
β - OpenAI Converterβ
β - Claude Converterβ
β - Custom Converterβ
ββββββββββββββββββββ
-
Create a converter:
# converters/mymodel.py from .base import ToolCallConverter class MyModelConverter(ToolCallConverter): def can_handle_model(self, model_name: str) -> bool: return 'mymodel' in model_name.lower() def parse_tool_calls(self, content: str) -> List[Dict]: # Your parsing logic here pass
-
Register the converter:
# In factory.py or at runtime from converters.factory import converter_factory converter_factory.register_converter(MyModelConverter())
POST /v1/chat/completions- Chat completions with tool call conversionPOST /chat/completions- Alternative endpoint
GET /v1/models- List available modelsPOST /v1/completions- Text completionsPOST /v1/embeddings- Text embeddingsGET /health- Health check
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.5-air-hi-mlx",
"messages": [
{"role": "user", "content": "Search for Python tutorials"}
],
"tools": [
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
}
],
"stream": false
}'# Test modular converters
python test_modular_converters.py
# Test tool call conversion
python test_tool_call_real.py
# Test streaming functionality
python test_streaming_tools.py
# Test full API compatibility
python test_full_api.py# Start your backend model server (e.g., LM Studio on port 8888)
# Start the proxy server
python app.py
# Test with the example client
python lmstudio-tooluse-test.py-
Connection Refused
# Check if backend server is running curl http://localhost:8888/v1/models # Check proxy server curl http://localhost:5000/health
-
Tool Calls Not Converting
# Check if conversion is enabled export ENABLE_TOOL_CALL_CONVERSION=true # Check model detection # Make sure your model name matches the converter patterns
-
Import Errors
# Make sure you're in the correct directory cd /path/to/proxy python app.py
export FLASK_ENV=development
export LOG_LEVEL=DEBUG
python app.py- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Add tests for your changes
- Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
We welcome contributions for new model formats! Please:
- Create a converter in
converters/ - Add comprehensive tests
- Update documentation
- Submit a PR with examples
This project is licensed under the MIT License - see the LICENSE file for details.
The proxy server can handle GLM model <think> tags for better response formatting:
Set REMOVE_THINK_TAGS environment variable to control behavior:
REMOVE_THINK_TAGS=true(default): Remove complete<think>...</think>blocksREMOVE_THINK_TAGS=false: Preserve<think>content for debugging/transparency
With REMOVE_THINK_TAGS=true (default):
Input: "I need to analyze this. <think>Let me think step by step...</think> Here's my answer."
Output: "I need to analyze this. Here's my answer."
With REMOVE_THINK_TAGS=false:
Input: "I need to analyze this. <think>Let me think step by step...</think> Here's my answer."
Output: "I need to analyze this. Let me think step by step... Here's my answer."
Note: Malformed/orphaned think tags are always cleaned up regardless of the setting:
</think>without opening tag β removed<think>without closing tag β removed
The proxy server supports different timeout settings for regular and streaming requests:
REQUEST_TIMEOUT=3600: Regular request timeout (60 minutes)STREAMING_TIMEOUT=3600: Streaming request timeout (60 minutes)
For long-running streaming requests, you can disable the timeout:
export STREAMING_TIMEOUT=none
# or
export STREAMING_TIMEOUT=0
# or
export STREAMING_TIMEOUT=falseShort timeout for quick responses:
export REQUEST_TIMEOUT=30
export STREAMING_TIMEOUT=120No timeout for long streaming sessions:
export REQUEST_TIMEOUT=3600
export STREAMING_TIMEOUT=noneNote: Disabling streaming timeout is useful for:
- Long document generation
- Complex reasoning tasks
- Large dataset processing
- Extended conversations
- Built for seamless integration with LM Studio
- Compatible with OpenAI Python SDK
- Inspired by the need for universal AI model compatibility
- π Bug Reports: Open an issue
- π‘ Feature Requests: Start a discussion
- π Documentation: Wiki
- π Korean Documentation: README_ko.md
Made with β€οΈ for the AI community