[Celestica] Icecube: Config: Implement Software Overtemp Protection (OTP) for TH6 ASIC#1208
Open
zhongedward wants to merge 1 commit into
Open
Conversation
Contributor
|
@mikechoifb has imported this pull request. If you are a Meta employee, you can view this in D106028063. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pre-submission checklist
pip install -r requirements-dev.txt && pre-commit installpre-commit runSummary
This PR introduces a comprehensive software-based Overtemp Protection (OTP) mechanism for the Icecube platform. By integrating sensor monitoring with the
fan-serviceshutdown logic, we ensure the hardware is protected during thermal anomalies before reaching catastrophic physical limits.Currently, the Icecube platform lacks a software-driven emergency power-off sequence for the TH6 ASIC. Relying solely on hardware-level protection can be risky if the thermal ramp is too steep. This change establishes a "Soft OTP" layer to trigger an orderly shutdown when the TH6 temperature hits the critical threshold.
Key Changes
platform_manager.json: Exported the
SMB_CPLDsysfs path to ensuresensor_servicehas consistent access to temperature registers.sensor_service.json: Defined the
TH6_TEMPsensor (mapped toSMB_CPLD) with a critical threshold (upperCriticalVal) of101.0°C.fan_service.json:
shutdownConditiontriggered byTH6_TEMP.shutdownCmdto explicitly disable TH6 power viaSMB_CPLD(echo 0 > /run/devmap/cplds/SMB_CPLD/th6_pwr_en).Test Plan
Syntax Validation: Validated JSON syntax.
Formatting: Pretty-printed configurations using the jq command for readability.
Build & Config Tests: Compilation and configuration validation tests passed successfully.
Service Verification: Confirmed that the following services start and run without errors:
End-to-End Thermal Protection Verification (Soft OTP)
To verify the effectiveness of the software shutdown logic, we performed a controlled thermal stress test:
Methodology:
platform_manager. This ensures the hardware-level protection is bypassed during the test window, allowing the software logic to be the primary defender.fan_servicepolling cycle and system logs to capture the exact trigger point.Test Result:
TH6_TEMPhit the software-defined threshold of 101°C, thefan_servicesuccessfully identified theshutdownCondition.shutdownCmdwas immediately triggered, executing:echo 0 > /run/devmap/cplds/SMB_CPLD/th6_pwr_en.zipcontains specific evidence:temp_shutdown_monitor: Captures the thermal ramp and thefan_servicetrigger event at101.604°C.pcie_shutdown_monitor: Verifies the physical removal of TH6 from the PCIe bus post-shutdown, confirming successful power-off.Attachment:
icecube_sw_OTP_test_2026_04_24_log.zip