Skip to content

A hierarchical tool stacking framework that explores whether multiple small language models can collaborate to achieve performance comparable to large frontier models like GPT-4o.

Notifications You must be signed in to change notification settings

Lyt060814/agentic-llm-stacking

Repository files navigation

Agentic LLM Stacking

Can Small Models Collaboratively Approach Large Model Performance?

A hierarchical tool stacking framework that explores whether multiple small language models can collaborate to achieve performance comparable to large frontier models like GPT-4o.


🎯 Overview

This project implements a Hierarchical Tool Stacking approach to enhance small language model performance on complex reasoning tasks. Instead of relying on a single large model, we:

  1. Warm-up Phase: Evaluate multiple small models independently on training samples
  2. Stacking Phase: Progressively combine top-performing models in hierarchical layers
  3. Collaborative Reasoning: Allow models to call each other as tools through a ReAct agent framework

Key Insight: Small models can complement each other's strengths when properly orchestrated, potentially approaching the capabilities of much larger models at reduced cost.


πŸ“Š Experimental Tasks

MATH-500

  • Domain: Mathematical reasoning
  • Dataset: 500 challenging math problems requiring multi-step reasoning
  • Baseline: GPT-4o
  • Metrics: Exact match accuracy

GPQA (Graduate-Level Google-Proof Q&A)

  • Domain: Graduate-level science questions (Physics, Chemistry, Biology)
  • Dataset: Expert-written questions requiring deep domain knowledge
  • Baseline: GPT-4o
  • Metrics: Multiple-choice accuracy

πŸ—οΈ Architecture

Hierarchical Stacking Mechanism

Layer 0: Base Models
β”œβ”€β”€ Qwen2.5-72B-Instruct
β”œβ”€β”€ Llama-3.3-70B-Instruct
β”œβ”€β”€ DeepSeek-R1-Distill-Qwen-32B
β”œβ”€β”€ DeepSeek-R1-Distill-Llama-70B
└── Qwen-2.5-32B-Instruct

Layer 1: Single-Model Agents
β”œβ”€β”€ Agent(Qwen2.5-72B_0) β†’ Qwen2.5-72B_1
β”œβ”€β”€ Agent(Llama-3.3-70B_0) β†’ Llama-3.3-70B_1
└── ...

Layer 2+: Collaborative Agents
β”œβ”€β”€ Agent([Best_Model_2, Second_Best_1])
β”œβ”€β”€ Agent([Best_Model_2, Third_Best_2])
└── ... (iteratively stacked until performance plateaus)

ReAct Agent Framework

Each stacked layer uses a ReAct (Reasoning + Acting) loop:

Thought: [Analyze the problem]
Action: [Call a tool/model]
Action Input: {"query": "..."}
Observation: [Tool response]
... (iterate up to 15 times)
Final Answer: [Integrated result]

Critical Design Choice: The prompt enforces that agents must call all available tools at least once, preventing reliance on a single model and encouraging collaborative reasoning.


πŸš€ Quick Start

Installation

# Create conda environment
conda create -n llm-stacking python=3.10
conda activate llm-stacking

# Install dependencies
pip install -r requirements.txt

API Configuration

  1. Copy template.env to .env
  2. Add your OpenRouter API key:
OPENROUTER_API_KEY="your_api_key_here"

Note: This project uses OpenRouter to access multiple LLMs through a unified API. You can modify Stacking_agent/Basemodel.py to use other providers.


πŸ§ͺ Running Experiments

MATH-500 Task

Full training + testing (automatic stacking):

python main.py \
  --Task "MATH-500" \
  --tools "['Llama33_70B','DeepseekR1DistillQwen32B','Qwen25_72B','DeepseekR1DistillLlama70B','Qwen3_32B']" \
  --topN 5 \
  --tool_number 2 \
  --train_data_number 20

Test specific stacking structure (skip training):

python main.py \
  --Task "MATH-500" \
  --no_train \
  --Stacking "['Qwen25_72B_2','Llama33_70B_1']" \
  --topN 5 \
  --tool_number 2

GPQA Task

python main.py \
  --Task "GPQA" \
  --tools "['Llama33_70B','DeepseekR1DistillQwen32B','Qwen25_72B','DeepseekR1DistillLlama70B','Qwen3_32B']" \
  --topN 5 \
  --tool_number 2 \
  --train_data_number 20

Baseline Comparison (GPT-4o)

# MATH-500 baseline
python baseline/MATH-500/gpt4o.py

# GPQA baseline
python baseline/GPQA/gpt4o.py

πŸ“ Project Structure

.
β”œβ”€β”€ main.py                          # Main experiment runner
β”œβ”€β”€ Stacking_agent/
β”‚   β”œβ”€β”€ Stacking.py                  # Hierarchical stacking algorithm
β”‚   β”œβ”€β”€ warmup.py                    # Warm-up phase: evaluate individual models
β”‚   β”œβ”€β”€ agent.py                     # ReAct agent implementation
β”‚   β”œβ”€β”€ generator.py                 # Dynamic agent code generation
β”‚   β”œβ”€β”€ Basemodel.py                 # LLM API wrapper
β”‚   β”œβ”€β”€ utils.py                     # Metrics & task configurations
β”‚   β”œβ”€β”€ prompt/
β”‚   β”‚   └── ReAct_prompt.py          # ReAct prompt templates
β”‚   └── tools/
β”‚       β”œβ”€β”€ Llama.py                 # Llama-3.3-70B tool
β”‚       β”œβ”€β”€ Qwen25_72B.py            # Qwen2.5-72B tool
β”‚       β”œβ”€β”€ DeepseekR1DistillQwen32B.py
β”‚       β”œβ”€β”€ DeepseekR1DistillLlama70B.py
β”‚       └── Qwen3_32B.py
β”œβ”€β”€ Dataset/
β”‚   β”œβ”€β”€ MATH-500/
β”‚   β”‚   β”œβ”€β”€ train.json               # Training samples (20)
β”‚   β”‚   └── test.json                # Test samples (500)
β”‚   └── GPQA/
β”‚       β”œβ”€β”€ train.json
β”‚       β”œβ”€β”€ test.json
β”‚       └── all.json
β”œβ”€β”€ baseline/
β”‚   β”œβ”€β”€ MATH-500/
β”‚   β”‚   └── gpt4o.py                 # GPT-4o baseline
β”‚   └── GPQA/
β”‚       └── gpt4o.py
└── Result/
    └── Stacking/
        β”œβ”€β”€ MATH-500/                # Experimental results
        └── GPQA/

πŸ“ˆ Key Parameters

Parameter Description Typical Value
--Task Task name (MATH-500, GPQA) Required
--tools List of tool names to evaluate "['Llama33_70B','Qwen25_72B',...]"
--topN Number of top tools to stack 5
--tool_number Tools per agent in stacking 2
--train_data_number Training samples for warm-up 20
--no_train Skip training, use provided stacking Flag
--Stacking Predefined stacking structure "['Model_2','Model_1']"

πŸ”§ Adding New Models

  1. Create tool file in Stacking_agent/tools/:
# Stacking_agent/tools/YourModel.py
from Stacking_agent.Basemodel import Basemodel

class YourModel(Basemodel):
    def __init__(self):
        super().__init__(
            model_name="provider/model-name",  # OpenRouter model ID
            temperature=0.7
        )
        self.name = "YourModel"
        self.description = "Description of what this model does well"

    def _run(self, query, **kwargs):
        messages = [{"role": "user", "content": query}]
        response = self.chat(messages)
        return response, self.total_tokens
  1. Register in generator:
# Stacking_agent/generator.py
self.tool_mapping = {
    ...
    'YourModel': 'YourModel()',
    ...
}
  1. Import in tools/init.py:
from .YourModel import YourModel

πŸ“Š Understanding Results

Output Files

After running experiments, results are saved to Result/Stacking/{Task}/:

Warm-up Phase:

  • warmup_{ModelName}_0.json: Base model performance (no agent)
  • warmup_{ModelName}_1.json: Single-model agent performance
  • warmup_{ModelName}_2.json: Recursive stacking performance

Stacking Phase:

  • Stacking_['Model_X', 'Model_Y'].json: Combined agent performance
  • ['Model_X', 'Model_Y']_5_2_20.json: Final test results

Log Files:

  • log/{Task}_{timestamp}.txt: Complete stacking results and scores

Evaluation Metrics

MATH-500:

  • Exact Match: Normalized exact answer matching
  • Accuracy: Percentage of correctly solved problems

GPQA:

  • Accuracy: Multiple-choice answer accuracy
  • Per-domain breakdown: Physics, Chemistry, Biology

🧬 Model Notation

  • ModelName_0: Base model (no stacking)
  • ModelName_1: Agent([ModelName_0, ModelName_0])
  • ModelName_2: Agent([ModelName_1, ModelName_0])
  • ['Model_2', 'Another_1']: Collaborative agent with two different models

πŸŽ“ Research Context

This framework was originally developed as ChemAmp for chemistry tasks, then adapted for general-domain reasoning to explore:

  • Cost-Performance Tradeoffs: Can coordinated small models match large models at lower cost?
  • Complementary Strengths: How do different model architectures collaborate?
  • Emergent Capabilities: Does hierarchical stacking unlock new reasoning patterns?

Key Finding: Stacking typically improves performance by 1-3 layers before plateauing, suggesting diminishing returns in deeper hierarchies.


πŸ™ Acknowledgements

  • Multi-agent experiment code adapted from GPTSwarm and AgentPrune
  • Original chemistry framework: ChemAmp (Chemical Amplified Chemistry Tools)

🀝 Contributing

Contributions welcome! Please:

  1. Test new models on both MATH-500 and GPQA
  2. Document performance changes in PRs
  3. Follow existing code structure in Stacking_agent/tools/

About

A hierarchical tool stacking framework that explores whether multiple small language models can collaborate to achieve performance comparable to large frontier models like GPT-4o.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •