# [New Metric] Add Circular Bias Detection (CBD) Integrity Score #719

hongping-zh · 2025-11-19T05:04:02Z

⚠️ Experimental Status

This metric is submitted as an experimental research tool rather than a production-ready solution. Key limitations:

Mathematical foundations: While based on established statistical methods (Pearson correlation, coefficient of variation), the specific combination and thresholds lack rigorous theoretical justification.
Empirical validation: Limited validation on real-world datasets. Current validation primarily uses synthetic data with known bias patterns.
Threshold calibration: Risk level thresholds (30, 60) are heuristic and not validated across diverse domains.
Causal inference: Measures correlation, not causation. High ρ_PC indicates correlation between protocol changes and performance, but does not prove circular bias.

We believe this metric addresses an important gap in evaluation methodology, but acknowledge it requires further research and community validation.

🎯 Motivation

This PR introduces a meta-evaluation metric to the evaluate ecosystem, addressing a gap in evaluation integrity.

The Problem: Existing metrics in the evaluate library measure model performance (accuracy, F1, BLEU, etc.), but they don't measure whether the evaluation process itself is statistically reliable. This creates a blind spot: researchers and practitioners can unknowingly produce inflated performance scores through circular reasoning bias.

The Solution: The Circular Bias Detection (CBD) Integrity Score provides a quantitative tool to assess evaluation trustworthiness by detecting circular dependencies between evaluation protocol choices and resulting performance.

Why This Matters

Circular reasoning bias occurs when:

A model is evaluated on a benchmark
The evaluation protocol is adjusted (hyperparameters, prompts, data splits)
The model is re-evaluated
Steps 2-3 are repeated until satisfactory results are achieved
Only the final "best" result is reported

This practice—while common—fundamentally compromises evaluation validity. CBD detects this pattern by measuring the correlation between protocol changes and performance improvements.

Slogan: Ensuring your evaluation is trustworthy. Stop circular reasoning in AI benchmarks.

📊 What It Calculates

The metric computes ρ_PC (Protocol-Performance Correlation), which measures the correlation between evaluation protocol changes and model performance scores. High correlation may indicate circular dependency, but does not prove causation.

Core Indicators

ρ_PC (Primary): Pearson correlation between protocol variations and performance scores
- Range: -1 to 1
- High |ρ_PC| values indicate correlation (not necessarily causation)
- ⚠️ Note: Threshold interpretation is heuristic, not empirically validated
CBD Score: Overall integrity score (0-100 scale)
- 0-30: Low risk (evaluation appears sound)
- 30-60: Moderate risk (some circular dependency)
- 60-100: High risk (significant circular bias)
- ⚠️ Note: Thresholds are preliminary guidelines based on limited data
PSI (Optional): Performance-Structure Independence
- Measures parameter stability across evaluation periods
- Cannot distinguish legitimate improvement from circular bias
CCS (Optional): Constraint-Consistency Score
- Measures consistency of constraint specifications
- Uses coefficient of variation (scale-invariant measure)

💻 Usage

Basic Example

import evaluate

# Load the metric
cbd_metric = evaluate.load("circular_bias_integrity")

# Example: 5 evaluation rounds with increasing performance
performance_scores = [0.85, 0.87, 0.91, 0.89, 0.93]
protocol_variations = [0.1, 0.15, 0.25, 0.20, 0.30]

# Compute CBD score
results = cbd_metric.compute(
    performance_scores=performance_scores,
    protocol_variations=protocol_variations
)

print(f"CBD Score: {results['cbd_score']:.1f}")
# Output: CBD Score: 78.5

print(f"Risk Level: {results['risk_level']}")
# Output: Risk Level: HIGH

print(f"ρ_PC: {results['rho_pc']:.3f}")
# Output: ρ_PC: 0.785

Advanced Example with Full Matrix Data

import evaluate
import numpy as np

cbd_metric = evaluate.load("circular_bias_integrity")

# Performance across 5 time periods for 3 algorithms
performance_matrix = np.array([
    [0.85, 0.78, 0.82],
    [0.87, 0.80, 0.84],
    [0.91, 0.84, 0.88],
    [0.89, 0.82, 0.86],
    [0.93, 0.86, 0.90]
])

# Constraint specifications (e.g., batch_size, learning_rate)
constraint_matrix = np.array([
    [512, 0.001],
    [550, 0.0015],
    [600, 0.002],
    [580, 0.0018],
    [620, 0.0022]
])

# Compute all indicators
results = cbd_metric.compute(
    performance_scores=performance_matrix.mean(axis=1).tolist(),
    protocol_variations=[0.1, 0.15, 0.25, 0.20, 0.30],
    performance_matrix=performance_matrix,
    constraint_matrix=constraint_matrix,
    return_all_indicators=True
)

print(f"ρ_PC: {results['rho_pc']:.3f}")
print(f"PSI: {results['psi_score']:.3f}")
print(f"CCS: {results['ccs_score']:.3f}")

🧪 Testing

All tests pass, covering:

✅ Low bias scenarios (uncorrelated data)
✅ High bias scenarios (strong correlation)
✅ Moderate bias scenarios
✅ Negative correlation detection
✅ Edge cases (constant performance, minimum data requirements)
✅ Input validation (length mismatch, missing inputs)
✅ Optional indicators (PSI, CCS)
✅ Output type validation

Note: Tests validate implementation correctness, not real-world effectiveness. The metric's ability to detect actual circular bias in production scenarios requires further validation.

Run tests:

pytest tests/test_circular_bias_integrity.py -v

📚 Implementation Details

Files Added

metrics/circular_bias_integrity/
├── circular_bias_integrity.py  # Core metric implementation
├── README.md                    # Comprehensive documentation
├── requirements.txt             # Dependencies (scipy, numpy)
└── app.py                       # Gradio widget launcher

tests/
└── test_circular_bias_integrity.py  # Comprehensive test suite

Design Decisions

Simplified MVP for PR: This initial version focuses on ρ_PC as the primary indicator. The full CBD framework (available in the standalone library) includes bootstrap confidence intervals and adaptive thresholds, which are computationally intensive and better suited for offline analysis.
User-Provided Data: The metric assumes users provide pre-computed evaluation results rather than running evaluations itself. This keeps the metric lightweight and focused on meta-evaluation.
Flexible Interface: Supports both simple (performance_scores, protocol_variations) and advanced (performance_matrix, constraint_matrix) usage patterns.
Backward Compatibility: Also accepts standard predictions/references interface for consistency with other metrics.
Experimental Nature: Comprehensive mathematical documentation (MATHEMATICAL_FOUNDATIONS.md) explicitly states assumptions, limitations, and areas requiring further research.

Known Limitations

Threshold Validity: Risk thresholds (30, 60) are heuristic, not empirically validated across domains
Correlation vs. Causation: High ρ_PC indicates correlation, not proof of circular bias
Limited Real-World Validation: Primarily tested on synthetic data; effectiveness on real-world bias unknown
Statistical Assumptions: Assumes linear relationships, independence, and normality (often violated)
Protocol Quantification: Requires subjective quantification of protocol changes

🔗 References

GitHub Repository: hongping-zh/circular-bias-detection
Software DOI: 10.5281/zenodo.17201032
Dataset DOI: 10.5281/zenodo.17196639
Live Demo: Try Sleuth (CBD Web App)
Paper: Zhang, H. (2025). "Circular Bias Detection: A Comprehensive Statistical Framework for Detecting Circular Reasoning Bias in AI Algorithm Evaluation." arXiv preprint arXiv:2501.xxxxx

✅ Checklist

Code follows Hugging Face style guidelines
Comprehensive docstrings with examples
Full test coverage (13 test cases)
README.md with metric card
requirements.txt with dependencies
app.py for Gradio widget
Citation information included
All tests passing locally
No breaking changes to existing code

🤝 Contribution Context

This metric implements the Circular Bias Detection (CBD) framework developed by Hongping Zhang. The framework has been:

Published on Zenodo with DOI
Deployed as a web application (Sleuth)
Used in real-world AI evaluation audits
Cited in evaluation methodology discussions

By integrating CBD into the evaluate library, we make this critical meta-evaluation capability accessible to the broader Hugging Face community.

💡 Future Research Directions

This metric represents a starting point for meta-evaluation. Future work needed:

Theoretical Foundations:

Rigorous hypothesis testing procedures
Power analysis for minimum sample size determination
Multivariate analysis of indicator joint distribution
Causal inference methods integration

Empirical Validation:

Large-scale validation on real-world datasets
Cross-domain validation studies
Ground truth collection (labeled biased/unbiased evaluations)
Threshold calibration via ROC analysis

Methodological Extensions:

Bootstrap confidence intervals for statistical significance
Robust estimators resistant to outliers
Non-linear relationship detection
Time series methods for trend detection
Bayesian framework for uncertainty quantification

Practical Enhancements:

Adaptive threshold computation
Visualization utilities
Integration with Trainer API
Support for categorical protocols

🙏 Acknowledgements

Thank you to the Hugging Face team for maintaining this excellent evaluation library. This contribution aims to strengthen the evaluation ecosystem by adding a critical integrity check that complements existing performance metrics.

Questions or feedback? Please let me know in the PR comments!

Reviewer Notes:

This is a meta-evaluation metric (evaluates the evaluation process, not model outputs directly)
Minimal dependencies (scipy, numpy - already used by other metrics)
No changes to core evaluate library code
Self-contained implementation in dedicated directory
Comprehensive documentation and tests included

- Add comprehensive MATHEMATICAL_FOUNDATIONS.md with rigorous definitions - Enhance README.md with experimental status warning - Remove overstated claims and add detailed limitations - Clarify thresholds are heuristic, not validated - Emphasize correlation vs. causation distinction

hongping added 2 commits November 19, 2025 12:58

feat(metrics): Add Circular Bias Detection (CBD) Integrity Score

d5077bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

# [New Metric] Add Circular Bias Detection (CBD) Integrity Score #719

# [New Metric] Add Circular Bias Detection (CBD) Integrity Score #719

Uh oh!

hongping-zh commented Nov 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

# [New Metric] Add Circular Bias Detection (CBD) Integrity Score #719

Are you sure you want to change the base?

# [New Metric] Add Circular Bias Detection (CBD) Integrity Score #719

Uh oh!

Conversation

hongping-zh commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Experimental Status

🎯 Motivation

Why This Matters

📊 What It Calculates

Core Indicators

💻 Usage

Basic Example

Advanced Example with Full Matrix Data

🧪 Testing

📚 Implementation Details

Files Added

Design Decisions

Known Limitations

🔗 References

✅ Checklist

🤝 Contribution Context

💡 Future Research Directions

🙏 Acknowledgements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hongping-zh commented Nov 19, 2025 •

edited

Loading