# [New Metric] Add Circular Bias Detection (CBD) Integrity Score #719
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This metric is submitted as an experimental research tool rather than a production-ready solution. Key limitations:
Mathematical foundations: While based on established statistical methods (Pearson correlation, coefficient of variation), the specific combination and thresholds lack rigorous theoretical justification.
Empirical validation: Limited validation on real-world datasets. Current validation primarily uses synthetic data with known bias patterns.
Threshold calibration: Risk level thresholds (30, 60) are heuristic and not validated across diverse domains.
Causal inference: Measures correlation, not causation. High ρ_PC indicates correlation between protocol changes and performance, but does not prove circular bias.
We believe this metric addresses an important gap in evaluation methodology, but acknowledge it requires further research and community validation.
🎯 Motivation
This PR introduces a meta-evaluation metric to the
evaluateecosystem, addressing a gap in evaluation integrity.The Problem: Existing metrics in the
evaluatelibrary measure model performance (accuracy, F1, BLEU, etc.), but they don't measure whether the evaluation process itself is statistically reliable. This creates a blind spot: researchers and practitioners can unknowingly produce inflated performance scores through circular reasoning bias.The Solution: The Circular Bias Detection (CBD) Integrity Score provides a quantitative tool to assess evaluation trustworthiness by detecting circular dependencies between evaluation protocol choices and resulting performance.
Why This Matters
Circular reasoning bias occurs when:
This practice—while common—fundamentally compromises evaluation validity. CBD detects this pattern by measuring the correlation between protocol changes and performance improvements.
Slogan: Ensuring your evaluation is trustworthy. Stop circular reasoning in AI benchmarks.
📊 What It Calculates
The metric computes ρ_PC (Protocol-Performance Correlation), which measures the correlation between evaluation protocol changes and model performance scores. High correlation may indicate circular dependency, but does not prove causation.
Core Indicators
ρ_PC (Primary): Pearson correlation between protocol variations and performance scores
CBD Score: Overall integrity score (0-100 scale)
PSI (Optional): Performance-Structure Independence
CCS (Optional): Constraint-Consistency Score
💻 Usage
Basic Example
Advanced Example with Full Matrix Data
🧪 Testing
All tests pass, covering:
Note: Tests validate implementation correctness, not real-world effectiveness. The metric's ability to detect actual circular bias in production scenarios requires further validation.
Run tests:
📚 Implementation Details
Files Added
Design Decisions
Simplified MVP for PR: This initial version focuses on ρ_PC as the primary indicator. The full CBD framework (available in the standalone library) includes bootstrap confidence intervals and adaptive thresholds, which are computationally intensive and better suited for offline analysis.
User-Provided Data: The metric assumes users provide pre-computed evaluation results rather than running evaluations itself. This keeps the metric lightweight and focused on meta-evaluation.
Flexible Interface: Supports both simple (performance_scores, protocol_variations) and advanced (performance_matrix, constraint_matrix) usage patterns.
Backward Compatibility: Also accepts standard
predictions/referencesinterface for consistency with other metrics.Experimental Nature: Comprehensive mathematical documentation (MATHEMATICAL_FOUNDATIONS.md) explicitly states assumptions, limitations, and areas requiring further research.
Known Limitations
🔗 References
✅ Checklist
🤝 Contribution Context
This metric implements the Circular Bias Detection (CBD) framework developed by Hongping Zhang. The framework has been:
By integrating CBD into the
evaluatelibrary, we make this critical meta-evaluation capability accessible to the broader Hugging Face community.💡 Future Research Directions
This metric represents a starting point for meta-evaluation. Future work needed:
Theoretical Foundations:
Empirical Validation:
Methodological Extensions:
Practical Enhancements:
TrainerAPI🙏 Acknowledgements
Thank you to the Hugging Face team for maintaining this excellent evaluation library. This contribution aims to strengthen the evaluation ecosystem by adding a critical integrity check that complements existing performance metrics.
Questions or feedback? Please let me know in the PR comments!
Reviewer Notes:
evaluatelibrary code