Can Small Models Collaboratively Approach Large Model Performance?
A hierarchical tool stacking framework that explores whether multiple small language models can collaborate to achieve performance comparable to large frontier models like GPT-4o.
This project implements a Hierarchical Tool Stacking approach to enhance small language model performance on complex reasoning tasks. Instead of relying on a single large model, we:
- Warm-up Phase: Evaluate multiple small models independently on training samples
- Stacking Phase: Progressively combine top-performing models in hierarchical layers
- Collaborative Reasoning: Allow models to call each other as tools through a ReAct agent framework
Key Insight: Small models can complement each other's strengths when properly orchestrated, potentially approaching the capabilities of much larger models at reduced cost.
- Domain: Mathematical reasoning
- Dataset: 500 challenging math problems requiring multi-step reasoning
- Baseline: GPT-4o
- Metrics: Exact match accuracy
- Domain: Graduate-level science questions (Physics, Chemistry, Biology)
- Dataset: Expert-written questions requiring deep domain knowledge
- Baseline: GPT-4o
- Metrics: Multiple-choice accuracy
Layer 0: Base Models
βββ Qwen2.5-72B-Instruct
βββ Llama-3.3-70B-Instruct
βββ DeepSeek-R1-Distill-Qwen-32B
βββ DeepSeek-R1-Distill-Llama-70B
βββ Qwen-2.5-32B-Instruct
Layer 1: Single-Model Agents
βββ Agent(Qwen2.5-72B_0) β Qwen2.5-72B_1
βββ Agent(Llama-3.3-70B_0) β Llama-3.3-70B_1
βββ ...
Layer 2+: Collaborative Agents
βββ Agent([Best_Model_2, Second_Best_1])
βββ Agent([Best_Model_2, Third_Best_2])
βββ ... (iteratively stacked until performance plateaus)
Each stacked layer uses a ReAct (Reasoning + Acting) loop:
Thought: [Analyze the problem]
Action: [Call a tool/model]
Action Input: {"query": "..."}
Observation: [Tool response]
... (iterate up to 15 times)
Final Answer: [Integrated result]
Critical Design Choice: The prompt enforces that agents must call all available tools at least once, preventing reliance on a single model and encouraging collaborative reasoning.
# Create conda environment
conda create -n llm-stacking python=3.10
conda activate llm-stacking
# Install dependencies
pip install -r requirements.txt- Copy
template.envto.env - Add your OpenRouter API key:
OPENROUTER_API_KEY="your_api_key_here"Note: This project uses OpenRouter to access multiple LLMs through a unified API. You can modify Stacking_agent/Basemodel.py to use other providers.
Full training + testing (automatic stacking):
python main.py \
--Task "MATH-500" \
--tools "['Llama33_70B','DeepseekR1DistillQwen32B','Qwen25_72B','DeepseekR1DistillLlama70B','Qwen3_32B']" \
--topN 5 \
--tool_number 2 \
--train_data_number 20Test specific stacking structure (skip training):
python main.py \
--Task "MATH-500" \
--no_train \
--Stacking "['Qwen25_72B_2','Llama33_70B_1']" \
--topN 5 \
--tool_number 2python main.py \
--Task "GPQA" \
--tools "['Llama33_70B','DeepseekR1DistillQwen32B','Qwen25_72B','DeepseekR1DistillLlama70B','Qwen3_32B']" \
--topN 5 \
--tool_number 2 \
--train_data_number 20# MATH-500 baseline
python baseline/MATH-500/gpt4o.py
# GPQA baseline
python baseline/GPQA/gpt4o.py.
βββ main.py # Main experiment runner
βββ Stacking_agent/
β βββ Stacking.py # Hierarchical stacking algorithm
β βββ warmup.py # Warm-up phase: evaluate individual models
β βββ agent.py # ReAct agent implementation
β βββ generator.py # Dynamic agent code generation
β βββ Basemodel.py # LLM API wrapper
β βββ utils.py # Metrics & task configurations
β βββ prompt/
β β βββ ReAct_prompt.py # ReAct prompt templates
β βββ tools/
β βββ Llama.py # Llama-3.3-70B tool
β βββ Qwen25_72B.py # Qwen2.5-72B tool
β βββ DeepseekR1DistillQwen32B.py
β βββ DeepseekR1DistillLlama70B.py
β βββ Qwen3_32B.py
βββ Dataset/
β βββ MATH-500/
β β βββ train.json # Training samples (20)
β β βββ test.json # Test samples (500)
β βββ GPQA/
β βββ train.json
β βββ test.json
β βββ all.json
βββ baseline/
β βββ MATH-500/
β β βββ gpt4o.py # GPT-4o baseline
β βββ GPQA/
β βββ gpt4o.py
βββ Result/
βββ Stacking/
βββ MATH-500/ # Experimental results
βββ GPQA/
| Parameter | Description | Typical Value |
|---|---|---|
--Task |
Task name (MATH-500, GPQA) |
Required |
--tools |
List of tool names to evaluate | "['Llama33_70B','Qwen25_72B',...]" |
--topN |
Number of top tools to stack | 5 |
--tool_number |
Tools per agent in stacking | 2 |
--train_data_number |
Training samples for warm-up | 20 |
--no_train |
Skip training, use provided stacking | Flag |
--Stacking |
Predefined stacking structure | "['Model_2','Model_1']" |
- Create tool file in
Stacking_agent/tools/:
# Stacking_agent/tools/YourModel.py
from Stacking_agent.Basemodel import Basemodel
class YourModel(Basemodel):
def __init__(self):
super().__init__(
model_name="provider/model-name", # OpenRouter model ID
temperature=0.7
)
self.name = "YourModel"
self.description = "Description of what this model does well"
def _run(self, query, **kwargs):
messages = [{"role": "user", "content": query}]
response = self.chat(messages)
return response, self.total_tokens- Register in generator:
# Stacking_agent/generator.py
self.tool_mapping = {
...
'YourModel': 'YourModel()',
...
}- Import in tools/init.py:
from .YourModel import YourModelAfter running experiments, results are saved to Result/Stacking/{Task}/:
Warm-up Phase:
warmup_{ModelName}_0.json: Base model performance (no agent)warmup_{ModelName}_1.json: Single-model agent performancewarmup_{ModelName}_2.json: Recursive stacking performance
Stacking Phase:
Stacking_['Model_X', 'Model_Y'].json: Combined agent performance['Model_X', 'Model_Y']_5_2_20.json: Final test results
Log Files:
log/{Task}_{timestamp}.txt: Complete stacking results and scores
MATH-500:
- Exact Match: Normalized exact answer matching
- Accuracy: Percentage of correctly solved problems
GPQA:
- Accuracy: Multiple-choice answer accuracy
- Per-domain breakdown: Physics, Chemistry, Biology
ModelName_0: Base model (no stacking)ModelName_1: Agent([ModelName_0, ModelName_0])ModelName_2: Agent([ModelName_1, ModelName_0])['Model_2', 'Another_1']: Collaborative agent with two different models
This framework was originally developed as ChemAmp for chemistry tasks, then adapted for general-domain reasoning to explore:
- Cost-Performance Tradeoffs: Can coordinated small models match large models at lower cost?
- Complementary Strengths: How do different model architectures collaborate?
- Emergent Capabilities: Does hierarchical stacking unlock new reasoning patterns?
Key Finding: Stacking typically improves performance by 1-3 layers before plateauing, suggesting diminishing returns in deeper hierarchies.
- Multi-agent experiment code adapted from GPTSwarm and AgentPrune
- Original chemistry framework: ChemAmp (Chemical Amplified Chemistry Tools)
Contributions welcome! Please:
- Test new models on both MATH-500 and GPQA
- Document performance changes in PRs
- Follow existing code structure in
Stacking_agent/tools/