Skip to content

Suggestion: pager/alerts/auto create issues for infra-related issues #2359

Closed
@mmarchini

Description

@mmarchini

Sometimes jobs will fail on easily fixable problems like read-only fs for days until someone notices it. Ideally, when a job fails for an infra-related issue, collaborators will ping the WG, but this either doesn't happen sometimes or the WG is overloaded with pings for multiple reasons (not only for infra-related issues). On top of that, GitHub notifications interface doesn't provide an easy way to look at all pings to a specific team, and since most of us are in multiple teams we have a lot of mixed "Team Mention" notifications.

What if we could identify (most) infra issues on Jenkins and let the WG know in a timely fashion? This would help us act on issues sooner when we are available. This wouldn't imply an SLA for the WG, we're still all volunteers, if no one is available the issue will remain unfixed until someone is, which is fine.

I'm not sure if the best approach is to create an issue here, to send a notification on IRC, email, or to use something like PagerDuty. Regardless of how we get notified, I believe we can accomplish this with a small effort and without introducing too much maintenance burden. Jenkins is already hooked to github-bot, and ncu-ci has some heuristics to identify infra and Jenkins issues (although today it bulks infra issues with build issues, which is easily fixable). We could use similar heuristics on github-bot to identify potential infra issues and send alerts when they happen. Or github-bot could forward all failures to a separate service which will do that (if we want to decouple but don't want to add a new hook to Jenkins). We could even add thresholds for certain errors (for example, "read-only fs" could trigger on the first occurence, but more flaky issues like corrupted git directory could require X out of Y failures to trigger).

What do y'all think? If folks are on board with this, I can implement a proof of concept.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions