Skip to content

Conversation

aneeshkp
Copy link
Contributor

@aneeshkp aneeshkp commented Aug 21, 2025

This commit implements logic to correctly update the clock class
when a FAULTY/RECOVERY condition of the port is detected.
Summary :
Ensure PMC clock-class queries run at most once per configName concurrently.
Trigger PMC on port Faulty/Holdover and recovery transitions to refresh GM class.
Run PMC polling outside the log scanner so we still update when ptp4l is silent.

What’s changed
Per-config single-flight guard:
Added a shared sync.Map keyed by configName to serialize updateClockClass across processes referencing the same config.
updateClockClass now CASes on the shared guard and logs when a run is skipped.
Cleanup: guard entry is removed on cmdStop() of the corresponding process.

Faulty/Holdover and recovery handling:
When the clock enters HOLDOVER or LOCKED without a specific iface, we now trigger updateClockClass (after a short delay) so GM class is refreshed on port fault/recovery events, not just on normal offset updates.
PMC polling outside the scanner:
Introduced/relied on a background loop that consumes a pmcCheck flag and calls updateClockClass independent of log scanner activity.
A periodic ticker sets pmcCheck for ptp4l processes; the background loop handles the PMC call even if ptp4l is not emitting logs.

Release note
Improve clock class update robustness: deduplicate PMC queries per config, update on Faulty/Holdover and recovery, and poll PMC independently of log output.

@openshift-ci openshift-ci bot requested review from vitus133 and zshi-redhat August 21, 2025 18:40
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2025
@aneeshkp aneeshkp force-pushed the fix-clockclass-4.18E branch 3 times, most recently from 116ef40 to c5380f8 Compare August 25, 2025 14:34
@aneeshkp aneeshkp changed the title fix(clockclass): handle port change clock class update OCPBUGS-59883: When Locking PTP Source to One NIC With Dual NIC PTP Synchronization Configured, Incorrect Clock Class Reported via REST API Aug 25, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 25, 2025
@openshift-ci-robot
Copy link
Contributor

@aneeshkp: This pull request references Jira Issue OCPBUGS-59883, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-59883 to depend on a bug targeting a version in 4.19.0, 4.19.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This commit implements logic to correctly update the clock class
when a FAULTY/RECOVERY condition of the port is detected.
Summary :
Ensure PMC clock-class queries run at most once per configName concurrently.
Trigger PMC on port Faulty/Holdover and recovery transitions to refresh GM class.
Run PMC polling outside the log scanner so we still update when ptp4l is silent.

What’s changed
Per-config single-flight guard:
Added a shared sync.Map keyed by configName to serialize updateClockClass across processes referencing the same config.
updateClockClass now CASes on the shared guard and logs when a run is skipped.
Cleanup: guard entry is removed on cmdStop() of the corresponding process.

Faulty/Holdover and recovery handling:
When the clock enters HOLDOVER or LOCKED without a specific iface, we now trigger updateClockClass (after a short delay) so GM class is refreshed on port fault/recovery events, not just on normal offset updates.
PMC polling outside the scanner:
Introduced/relied on a background loop that consumes a pmcCheck flag and calls updateClockClass independent of log scanner activity.
A periodic ticker sets pmcCheck for ptp4l processes; the background loop handles the PMC call even if ptp4l is not emitting logs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vitus133
Copy link
Contributor

Looks good for 4.18.
For 4.20 we need to make sure this code is not called in the T-BC setting (the upstream already has this check). Why do we start the fixing from 4.18 and not from the upstream?

} else if clockState == HOLDOVER || clockState == LOCKED {
// in case of holdover without iface, still need to update clock class for T_G
if p.name != ts2phcProcessName && p.name != syncEProcessName { // TGM announce clock class via events
p.SetPmcCheck(false) // reset pmc check since we are updating clock class here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be best to use consume for all times of setting it to false to make it easier to search of when it happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made changes as suggested

@@ -961,6 +1011,9 @@ func (p *ptpProcess) cmdStop() {
return
}
p.setStopped(true)
// reset runtime flags
p.SetPmcCheck(false)
atomic.StoreInt32(&p.clockClassRunning, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use atomic.Bool to properly signal what this used for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , let me try to replace with bool for better readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@aneeshkp aneeshkp force-pushed the fix-clockclass-4.18E branch from c5380f8 to 610328c Compare August 26, 2025 15:54
@aneeshkp aneeshkp force-pushed the fix-clockclass-4.18E branch from 610328c to efe6e28 Compare August 28, 2025 16:53
Copy link
Contributor

openshift-ci bot commented Aug 28, 2025

@aneeshkp: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@aneeshkp aneeshkp changed the title OCPBUGS-59883: When Locking PTP Source to One NIC With Dual NIC PTP Synchronization Configured, Incorrect Clock Class Reported via REST API [release-4-18] OCPBUGS-59883: When Locking PTP Source to One NIC With Dual NIC PTP Synchronization Configured, Incorrect Clock Class Reported via REST API Aug 29, 2025
@nocturnalastro
Copy link
Contributor

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Sep 5, 2025
@nocturnalastro
Copy link
Contributor

/ltgm

@nocturnalastro
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 5, 2025
Copy link
Contributor

openshift-ci bot commented Sep 5, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aneeshkp, nocturnalastro

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aneeshkp
Copy link
Contributor Author

aneeshkp commented Sep 5, 2025

/verified later @Bonnie-Block

@openshift-ci-robot openshift-ci-robot added verified-later verified Signifies that the PR passed pre-merge verification criteria labels Sep 5, 2025
@openshift-ci-robot
Copy link
Contributor

@aneeshkp: This PR has been marked to be verified later by @Bonnie-Block.

In response to this:

/verified later @Bonnie-Block

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria verified-later
Projects
None yet
Development

Successfully merging this pull request may close these issues.