Skip to content

Conversation

sgoggins
Copy link
Member

@sgoggins sgoggins commented Jul 2, 2025

Description

  • We think the release branch would get updated before doing the release? Or, is mergify already activated?

MoralCode and others added 25 commits June 25, 2025 16:17
Signed-off-by: Adrian Edwards <[email protected]>
Signed-off-by: Adrian Edwards <[email protected]>
- Replaces setup.py with pyproject.toml
- Creates .python-version file that will control the python version that
  uv uses in the virtualenv

Signed-off-by: John Strunk <[email protected]>
Signed-off-by: John Strunk <[email protected]>
Update GitHub workflow to include a new job that checks the uv lockfile,
ensuring it is up to date by installing uv and running the lock check
command.

Signed-off-by: John Strunk <[email protected]>
Added post_create_environment job to generate requirements.txt using uv,
and updated python install method to use the generated requirements file.

Signed-off-by: John Strunk <[email protected]>
Add keyman image and remove conditional login to GitHub Container Registry

Signed-off-by: John Strunk <[email protected]>
This adds a new test job to run on macOS. It installs the package
dependencies and runs the worker setup script.
Due to the interactive nature of `make install`, we can't use that
directly.

Signed-off-by: John Strunk <[email protected]>
This updates ugboost to fix the build on macos

Signed-off-by: John Strunk <[email protected]>
graphql-server-core 1.1.1 is incompatible w/ Python 3.10 due to changes in
the standard library. This ensures that a newer version gets installed.

Signed-off-by: John Strunk <[email protected]>
We fixed some documentation around how `uv run` gets executed in sequence with `nohup`

Signed-off-by: Sean P. Goggins <[email protected]>
Changed `git checkout dev` to `git checkout main` in docs file. 

Signed-off-by: Sean P. Goggins <[email protected]>
Fixing `nohup` and `uv` order in docs. 

Signed-off-by: Sean P. Goggins <[email protected]>
Fixing `nohup` sequencing using `uv run` as tested. 

Signed-off-by: Sean P. Goggins <[email protected]>
Fixed references to the `housekeeper`

Signed-off-by: Sean P. Goggins <[email protected]>
`nohup` ordering with `uv run`

Signed-off-by: Sean P. Goggins <[email protected]>
fixing `uv run` and `nohup` order in docs. 

Signed-off-by: Sean P. Goggins <[email protected]>
Try 2 - Switch from venv/pip to uv
Push the keyman container image
Comment on lines +34 to +42
name: runner / uv-lock
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Ensure uv lockfile is up to date
run: uv lock --check

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI 2 months ago

The fix involves adding a permissions block to the uv-lock job and assigning the least privileges required for its execution. Since the job steps seem to involve checking out code and running commands locally without modifying repository content, the contents: read permission is sufficient. This ensures the GITHUB_TOKEN has minimal access while keeping the workflow functional.


Suggested changeset 1
.github/workflows/checks.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml
--- a/.github/workflows/checks.yml
+++ b/.github/workflows/checks.yml
@@ -44,6 +44,8 @@
   uv-lock:
     name: runner / uv-lock
     runs-on: ubuntu-latest
+    permissions:
+      contents: read
     steps:
       - name: Check out repository
         uses: actions/checkout@v4
EOF
@@ -44,6 +44,8 @@
uv-lock:
name: runner / uv-lock
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- name: Check out repository
uses: actions/checkout@v4
Copilot is powered by AI and may make mistakes. Always verify output.
@sgoggins sgoggins requested a review from JohnStrunk July 2, 2025 00:21
MoralCode added 3 commits July 2, 2025 09:06
Signed-off-by: Adrian Edwards <[email protected]>
Signed-off-by: Adrian Edwards <[email protected]>
Signed-off-by: Adrian Edwards <[email protected]>
MoralCode and others added 3 commits July 10, 2025 20:28
Signed-off-by: Adrian Edwards <[email protected]>
determine whether to reset logs via new  AUGUR_RESET_LOGS environment variable

There are some linting issues revealed by this PR, but they are not a problem for this PR. They should be dealt with separately.
@@ -32,7 +32,9 @@

from keyman.KeyClient import KeyClient, KeyPublisher

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pylint] reported by reviewdog 🐶
W0611: Unused KeyClient imported from keyman.KeyClient (unused-import)

Copy link
Contributor

@MoralCode MoralCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I'm convinced it's just github weirdness.

We should merge a PR to bump the version number and edit the changelogs and other metadata fixes at the time we want to cut this release, but I'll stop being the squeaky wheel

MoralCode and others added 8 commits July 10, 2025 21:12
Signed-off-by: Adrian Edwards <[email protected]>
This changes the test for the DB connection components to not exit with
failure if AUGUR_DB can't be constructed.

Signed-off-by: John Strunk <[email protected]>
Address warnings from linter
Update entrypoint.sh construction of AUGUR_DB from env vars
fix missing docker -> podman in podman CI job
Handle 410 gone when querying github
@@ -124,6 +135,13 @@ def make_request(self, url, method="GET", timeout=100):
if response.status_code == 401:
raise NotAuthorizedException(f"Could not authorize with the github api")

if response.status_code == 410:
response_msg = response.json().get("message")
if response_msg is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pylint] reported by reviewdog 🐶
R1720: Unnecessary "else" after "raise", remove the "else" and de-indent the code inside it (no-else-raise)

@@ -145,13 +163,21 @@
return self.__make_request_with_retries(url, method, timeout)
except RetryError as e:
raise e.last_attempt.exception()

def _decide_retry_policy(exception: Exception) -> bool:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pylint] reported by reviewdog 🐶
E0213: Method '_decide_retry_policy' should have "self" as first argument (no-self-argument)

Returns:
bool: Boolean describing whether or not the request should be retried
"""
return not isinstance(num, (UrlNotFoundException, ResourceGoneException))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pylint] reported by reviewdog 🐶
E0602: Undefined variable 'num' (undefined-variable)

MoralCode and others added 5 commits July 17, 2025 18:09
Signed-off-by: Adrian Edwards <[email protected]>
Signed-off-by: Adrian Edwards <[email protected]>
use check_output for git log fetching so that it can fail more easily
correct docs links in error messages relating to keys
Copy link
Member Author

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MoralCode / @JohnStrunk / @Ulincsys : The main branch is leaking database connections. I have been through this PR to attempt to identify what change could have this side effect. Nothing is jumping out at me .. I made some comments, some of which are surely pointing a flashlight in the wrong corner ... If you have a chance to look this over and identify other possible places where this leak is introduced, THAT WOULD BE AWESOME!

NOTE: I have verified the database connection leak issue does NOT exist on the release branch.


@retry(stop=stop_after_attempt(10), wait=wait_fixed(5), retry=retry_if_exception(lambda exc: not isinstance(exc, UrlNotFoundException)))
@retry(stop=stop_after_attempt(10), wait=wait_fixed(5), retry=retry_if_exception(_decide_retry_policy))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be causing an issue where not found URL's are causing breaking errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in #3235

@@ -19,7 +20,22 @@
workers = multiprocessing.cpu_count() * 2 + 1
umask = 0o007
reload = True
reload_extra_files = glob(str(Path.cwd() / '**/*.j2'), recursive=True)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some issues here ... related to the disposal of database connections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the issues you are seeing? This is a gunicorn config, so it shouldn't be affecting DB connections

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MoralCode : In short, I have my POstgresql instance set to 3,000 connections .. and within one hour of starting Augur, they are all consumed. Which is bad in myriad ways, right? For comparison, the version of Augur in release uses max 200 connections over weeks.

@@ -22,6 +22,17 @@ class UrlNotFoundException(Exception):
class NotAuthorizedException(Exception):
pass

class ResourceGoneException(Exception):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generating errors in testing. @MoralCode .. basically we are getting errors when these resource exceptions are hit and the job fails. Perhaps that is intended. Just checking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is fixed by #3235 (TL;DR my fault).

If a resource returns 410 Gone from github it is handled similarly to a 404 (i.e. bail out of the retry loop and keep going with collection rather than propagating the exception up to the worker and stopping everything)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right.

def __init__(self) -> None:
# Load channel names and IDs from the spec
for channel in spec["channels"]:
# IE: self.ANNOUNCE = "augur-oauth-announce"
setattr(self, channel["name"], channel["id"])
self.conn = get_redis_connection()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems possibly wrong. @MoralCode / @Ulincsys ... we have changed the import from as conn to simply be the library. I don't see where conn exists as an import any longer. If this fails, it seems it could be the source of the database connection leaks on main right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds plausible to me. I remember seeing that this keyclient file got refactored, but maybe it needs a deeper look

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later on in my review it seemed like the object was getting used and reassigned in the new class made specifically for that ... I am suspicious of Redis in general, perhaps being used in a way that creates a database connection but doesn't let go of it ... or the process loses it ... Like I said, there is no change that appears to directly alter the way we are connecting to the database ... which is everyone's favorite type of bug, right?

@@ -172,7 +175,7 @@ def publish(self, key: str, platform: str):
"key_platform": platform
}

conn.publish(self.ANNOUNCE, json.dumps(message))
self.conn.publish(self.ANNOUNCE, json.dumps(message))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MoralCode / @Ulincsys : Here as well, conn is referenced but I don't think it exists any longer.

@@ -1,4 +1,4 @@
from augur.tasks.init.redis_connection import redis_connection as conn
from augur.tasks.init.redis_connection import get_redis_connection
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is where the as conn is removed, but is still in use through the rest of the file.

@@ -215,7 +218,7 @@ def wait(self, timeout_seconds = 30, republish = False):
}

listen_delta = 0.1
conn.publish(self.ANNOUNCE, json.dumps(message))
self.conn.publish(self.ANNOUNCE, json.dumps(message))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no conn

from augur.application.logs import AugurLogger

logger = AugurLogger("KeyOrchestrator").get_logger()
conn = get_redis_connection()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok .. here conn exists as a call to a method.

@@ -29,8 +29,10 @@ def __init__(self, platform: str, logger: Logger):
if not platform:
raise ValueError("Platform must not be empty")

self.stdout = conn
self.stdin: PubSub = conn.pubsub(ignore_subscribe_messages = True)
self.conn = get_redis_connection()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here is moving the refernce from the library to this call to the get_redis_connection() method.

Fix variable name in _decide_retry_policy
@sgoggins sgoggins requested a review from MoralCode July 23, 2025 20:26
Ulincsys and others added 2 commits July 24, 2025 18:24
cache redis connection on first retrieval; this fixes our open database connection infinite growth issue! I will talk with @ulincys before putting this to release to make sure I understand *why* this work. I have a conceptual grasp of how Redis is the issue, but less of a grasp of why this particular fix works. Mostly, before we release I want to understand if there are any possible side effects of this patch we need to look for.
@sgoggins sgoggins merged commit 5e49264 into release Jul 26, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants