Agentic LLM Stacking

Can Small Models Collaboratively Approach Large Model Performance?

A hierarchical tool stacking framework that explores whether multiple small language models can collaborate to achieve performance comparable to large frontier models like GPT-4o.

🎯 Overview

This project implements a Hierarchical Tool Stacking approach to enhance small language model performance on complex reasoning tasks. Instead of relying on a single large model, we:

Warm-up Phase: Evaluate multiple small models independently on training samples
Stacking Phase: Progressively combine top-performing models in hierarchical layers
Collaborative Reasoning: Allow models to call each other as tools through a ReAct agent framework

Key Insight: Small models can complement each other's strengths when properly orchestrated, potentially approaching the capabilities of much larger models at reduced cost.

📊 Experimental Tasks

MATH-500

Domain: Mathematical reasoning
Dataset: 500 challenging math problems requiring multi-step reasoning
Baseline: GPT-4o
Metrics: Exact match accuracy

GPQA (Graduate-Level Google-Proof Q&A)

Domain: Graduate-level science questions (Physics, Chemistry, Biology)
Dataset: Expert-written questions requiring deep domain knowledge
Baseline: GPT-4o
Metrics: Multiple-choice accuracy

🏗️ Architecture

Hierarchical Stacking Mechanism

Layer 0: Base Models
├── Qwen2.5-72B-Instruct
├── Llama-3.3-70B-Instruct
├── DeepSeek-R1-Distill-Qwen-32B
├── DeepSeek-R1-Distill-Llama-70B
└── Qwen-2.5-32B-Instruct

Layer 1: Single-Model Agents
├── Agent(Qwen2.5-72B_0) → Qwen2.5-72B_1
├── Agent(Llama-3.3-70B_0) → Llama-3.3-70B_1
└── ...

Layer 2+: Collaborative Agents
├── Agent([Best_Model_2, Second_Best_1])
├── Agent([Best_Model_2, Third_Best_2])
└── ... (iteratively stacked until performance plateaus)

ReAct Agent Framework

Each stacked layer uses a ReAct (Reasoning + Acting) loop:

Thought: [Analyze the problem]
Action: [Call a tool/model]
Action Input: {"query": "..."}
Observation: [Tool response]
... (iterate up to 15 times)
Final Answer: [Integrated result]

Critical Design Choice: The prompt enforces that agents must call all available tools at least once, preventing reliance on a single model and encouraging collaborative reasoning.

🚀 Quick Start

Installation

# Create conda environment
conda create -n llm-stacking python=3.10
conda activate llm-stacking

# Install dependencies
pip install -r requirements.txt

API Configuration

Copy template.env to .env
Add your OpenRouter API key:

OPENROUTER_API_KEY="your_api_key_here"

Note: This project uses OpenRouter to access multiple LLMs through a unified API. You can modify Stacking_agent/Basemodel.py to use other providers.

🧪 Running Experiments

MATH-500 Task

Full training + testing (automatic stacking):

python main.py \
  --Task "MATH-500" \
  --tools "['Llama33_70B','DeepseekR1DistillQwen32B','Qwen25_72B','DeepseekR1DistillLlama70B','Qwen3_32B']" \
  --topN 5 \
  --tool_number 2 \
  --train_data_number 20

Test specific stacking structure (skip training):

python main.py \
  --Task "MATH-500" \
  --no_train \
  --Stacking "['Qwen25_72B_2','Llama33_70B_1']" \
  --topN 5 \
  --tool_number 2

GPQA Task

python main.py \
  --Task "GPQA" \
  --tools "['Llama33_70B','DeepseekR1DistillQwen32B','Qwen25_72B','DeepseekR1DistillLlama70B','Qwen3_32B']" \
  --topN 5 \
  --tool_number 2 \
  --train_data_number 20

Baseline Comparison (GPT-4o)

# MATH-500 baseline
python baseline/MATH-500/gpt4o.py

# GPQA baseline
python baseline/GPQA/gpt4o.py

📁 Project Structure

.
├── main.py                          # Main experiment runner
├── Stacking_agent/
│   ├── Stacking.py                  # Hierarchical stacking algorithm
│   ├── warmup.py                    # Warm-up phase: evaluate individual models
│   ├── agent.py                     # ReAct agent implementation
│   ├── generator.py                 # Dynamic agent code generation
│   ├── Basemodel.py                 # LLM API wrapper
│   ├── utils.py                     # Metrics & task configurations
│   ├── prompt/
│   │   └── ReAct_prompt.py          # ReAct prompt templates
│   └── tools/
│       ├── Llama.py                 # Llama-3.3-70B tool
│       ├── Qwen25_72B.py            # Qwen2.5-72B tool
│       ├── DeepseekR1DistillQwen32B.py
│       ├── DeepseekR1DistillLlama70B.py
│       └── Qwen3_32B.py
├── Dataset/
│   ├── MATH-500/
│   │   ├── train.json               # Training samples (20)
│   │   └── test.json                # Test samples (500)
│   └── GPQA/
│       ├── train.json
│       ├── test.json
│       └── all.json
├── baseline/
│   ├── MATH-500/
│   │   └── gpt4o.py                 # GPT-4o baseline
│   └── GPQA/
│       └── gpt4o.py
└── Result/
    └── Stacking/
        ├── MATH-500/                # Experimental results
        └── GPQA/

📈 Key Parameters

Parameter	Description	Typical Value
`--Task`	Task name (`MATH-500`, `GPQA`)	Required
`--tools`	List of tool names to evaluate	`"['Llama33_70B','Qwen25_72B',...]"`
`--topN`	Number of top tools to stack	`5`
`--tool_number`	Tools per agent in stacking	`2`
`--train_data_number`	Training samples for warm-up	`20`
`--no_train`	Skip training, use provided stacking	Flag
`--Stacking`	Predefined stacking structure	`"['Model_2','Model_1']"`

🔧 Adding New Models

Create tool file in Stacking_agent/tools/:

# Stacking_agent/tools/YourModel.py
from Stacking_agent.Basemodel import Basemodel

class YourModel(Basemodel):
    def __init__(self):
        super().__init__(
            model_name="provider/model-name",  # OpenRouter model ID
            temperature=0.7
        )
        self.name = "YourModel"
        self.description = "Description of what this model does well"

    def _run(self, query, **kwargs):
        messages = [{"role": "user", "content": query}]
        response = self.chat(messages)
        return response, self.total_tokens

Register in generator:

# Stacking_agent/generator.py
self.tool_mapping = {
    ...
    'YourModel': 'YourModel()',
    ...
}

Import in tools/init.py:

from .YourModel import YourModel

📊 Understanding Results

Output Files

After running experiments, results are saved to Result/Stacking/{Task}/:

Warm-up Phase:

warmup_{ModelName}_0.json: Base model performance (no agent)
warmup_{ModelName}_1.json: Single-model agent performance
warmup_{ModelName}_2.json: Recursive stacking performance

Stacking Phase:

Stacking_['Model_X', 'Model_Y'].json: Combined agent performance
['Model_X', 'Model_Y']_5_2_20.json: Final test results

Log Files:

log/{Task}_{timestamp}.txt: Complete stacking results and scores

Evaluation Metrics

MATH-500:

Exact Match: Normalized exact answer matching
Accuracy: Percentage of correctly solved problems

GPQA:

Accuracy: Multiple-choice answer accuracy
Per-domain breakdown: Physics, Chemistry, Biology

🧬 Model Notation

ModelName_0: Base model (no stacking)
ModelName_1: Agent([ModelName_0, ModelName_0])
ModelName_2: Agent([ModelName_1, ModelName_0])
['Model_2', 'Another_1']: Collaborative agent with two different models

🎓 Research Context

This framework was originally developed as ChemAmp for chemistry tasks, then adapted for general-domain reasoning to explore:

Cost-Performance Tradeoffs: Can coordinated small models match large models at lower cost?
Complementary Strengths: How do different model architectures collaborate?
Emergent Capabilities: Does hierarchical stacking unlock new reasoning patterns?

Key Finding: Stacking typically improves performance by 1-3 layers before plateauing, suggesting diminishing returns in deeper hierarchies.

🙏 Acknowledgements

Multi-agent experiment code adapted from GPTSwarm and AgentPrune
Original chemistry framework: ChemAmp (Chemical Amplified Chemistry Tools)

🤝 Contributing

Contributions welcome! Please:

Test new models on both MATH-500 and GPQA
Document performance changes in PRs
Follow existing code structure in Stacking_agent/tools/

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
Dataset		Dataset
Stacking_agent		Stacking_agent
docs		docs
png		png
scripts		scripts
.gitignore		.gitignore
ABOUT.md		ABOUT.md
CLAUDE.md		CLAUDE.md
Experiment_cn.pdf		Experiment_cn.pdf
Multiagent.py		Multiagent.py
README.md		README.md
ablation.py		ablation.py
baseline_GPQA.py		baseline_GPQA.py
baseline_MATH-500.py		baseline_MATH-500.py
main.py		main.py
requirements.txt		requirements.txt
template.env		template.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agentic LLM Stacking

🎯 Overview

📊 Experimental Tasks

MATH-500

GPQA (Graduate-Level Google-Proof Q&A)

🏗️ Architecture

Hierarchical Stacking Mechanism

ReAct Agent Framework

🚀 Quick Start

Installation

API Configuration

🧪 Running Experiments

MATH-500 Task

GPQA Task

Baseline Comparison (GPT-4o)

📁 Project Structure

📈 Key Parameters

🔧 Adding New Models

📊 Understanding Results

Output Files

Evaluation Metrics

🧬 Model Notation

🎓 Research Context

🙏 Acknowledgements

🤝 Contributing

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Lyt060814/agentic-llm-stacking

Folders and files

Latest commit

History

Repository files navigation

Agentic LLM Stacking

🎯 Overview

📊 Experimental Tasks

MATH-500

GPQA (Graduate-Level Google-Proof Q&A)

🏗️ Architecture

Hierarchical Stacking Mechanism

ReAct Agent Framework

🚀 Quick Start

Installation

API Configuration

🧪 Running Experiments

MATH-500 Task

GPQA Task

Baseline Comparison (GPT-4o)

📁 Project Structure

📈 Key Parameters

🔧 Adding New Models

📊 Understanding Results

Output Files

Evaluation Metrics

🧬 Model Notation

🎓 Research Context

🙏 Acknowledgements

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages