Skip to content

Fix race condition in GitRepository.pull_code() with file-based locking#21388

Open
devin-ai-integration[bot] wants to merge 8 commits intomainfrom
devin/1775076072-fix-pull-code-race-condition
Open

Fix race condition in GitRepository.pull_code() with file-based locking#21388
devin-ai-integration[bot] wants to merge 8 commits intomainfrom
devin/1775076072-fix-pull-code-race-condition

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot commented Apr 1, 2026

Fixes a race condition in GitRepository.pull_code() where multiple concurrent flow runs sharing the same clone destination directory can race on the git_dir.exists() check-then-act logic, causing FileNotFoundError or corrupt clones. The shutil.rmtree() calls on failure paths can also delete a directory out from under a concurrent run.

Closes #11187

Changes

Adds dual-layer locking around the entire pull_code() method:

  • asyncio.Lock (per destination path) — coordinates concurrent async tasks within the same process without blocking the event loop
  • Internal FileLock (prefect/locking/_filelock.py) — coordinates across separate processes via a .lock file adjacent to the destination directory (e.g., repo.lock)

The internal FileLock uses lock file existence to indicate the lock is held (os.open(O_CREAT | O_EXCL) to acquire, Path.unlink() to release), following the same pattern as FileSystemLockManager. No OS-specific locking primitives are required — it works on any platform with basic filesystem operations.

Stale lock recovery: The owning process's PID is written to the lock file on acquisition. When contention is detected (FileExistsError), the lock file's PID is read and checked via os.kill(pid, 0). If the owning process is dead, the stale lock is removed and acquisition retries immediately — no need to wait for the 300s timeout.

The existing pull_code body is extracted into _pull_code_locked() with no logic changes.

Key review items

  • PID reuse edge case: If the OS reuses a crashed process's PID before another process checks staleness, the stale lock won't be cleaned up via PID check. The 300s timeout remains as a fallback for this narrow window.
  • os.kill(pid, 0) cross-platform behavior: Used for PID liveness checks. Verify this works correctly on Windows (should use OpenProcess under the hood in CPython).
  • Atomicity of os.open(O_CREAT | O_EXCL): Atomic on POSIX. Confirm this holds on Windows and network filesystems relevant to users.
  • _pull_code_locks module-level dict (storage.py): Maps destination paths to asyncio.Lock instances. Entries are never evicted — this is fine for typical usage (small number of distinct repos), but worth confirming that assumption.
  • aacquire() uses polling (asyncio.sleep(0.1) between lock attempts). This avoids blocking the event loop during cross-process contention but is a busy-wait pattern.
  • Code duplication between acquire() and aacquire() — they differ only in time.sleep vs asyncio.sleep. Kept separate for clarity rather than introducing a callback abstraction.
  • Cross-process locking is only tested via mocks in CI. Actual file-existence locking behavior is not exercised by the test suite.

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
  • If this pull request adds new functionality, it includes unit tests that cover the changes
  • If this pull request removes docs files, it includes redirect settings in mint.json.
  • If this pull request adds functions or classes, it includes helpful docstrings.

Link to Devin session: https://app.devin.ai/sessions/1e3a05b708534a449af4acc4a2d76cc1
Requested by: @desertaxle

Add file-based locking around GitRepository.pull_code() to prevent race
conditions when multiple concurrent flow runs use the same git repository.

Uses asyncio.Lock for in-process async coordination between concurrent
tasks and FileLock for cross-process coordination. The lock file is
created adjacent to the destination directory (e.g., dest.lock).

Closes #11187

Co-authored-by: alex.s <alex.s@prefect.io>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions github-actions bot added arch:windows Related to the Windows OS bug Something isn't working labels Apr 1, 2026
filelock is a transitive dependency not available in the prefect-client
package. Fall back to asyncio.Lock only when filelock is not installed.

Co-Authored-By: alex.s <ajstreed1@gmail.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 1, 2026

Merging this PR will not alter performance

✅ 2 untouched benchmarks


Comparing devin/1775076072-fix-pull-code-race-condition (2603ced) with main (9e5a66d)

Open in CodSpeed

devin-ai-integration[bot]

This comment was marked as resolved.

- Create prefect/locking/_filelock.py with cross-platform file lock using
  OS-level locking (fcntl.flock on Unix, msvcrt.locking on Windows)
- Use async-aware aacquire() method that polls with asyncio.sleep() to
  avoid blocking the event loop during cross-process lock contention
- Fix lock path derivation: use parent/(name + '.lock') instead of
  with_suffix('.lock') which incorrectly replaces existing suffixes
- Remove filelock transitive dependency usage entirely
- Update tests to work with new internal FileLock

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

devin-ai-integration bot and others added 2 commits April 2, 2026 15:16
- Wrap acquire/aacquire polling loops in try/except BaseException to
  close the fd on CancelledError or any other unexpected exception
- Wrap _unlock_fd in release() with try/finally to ensure os.close()
  runs even if unlock raises

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
@desertaxle desertaxle marked this pull request as ready for review April 2, 2026 19:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13e3555258

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

devin-ai-integration bot and others added 2 commits April 2, 2026 19:40
- Handle ImportError for fcntl/msvcrt so _filelock.py loads on any OS
- FileLock.acquire/aacquire silently no-op when locking is unavailable
- pull_code() catches lock acquisition failures at runtime and falls
  back to asyncio.Lock only, logging a debug message

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
Rewrites _filelock.py to use Path.touch(exist_ok=False) / Path.unlink()
instead of fcntl/msvcrt, following the same pattern as
FileSystemLockManager. No OS-specific imports needed — works on any
platform that supports basic filesystem operations.

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
@desertaxle
Copy link
Copy Markdown
Member

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a86567757

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Write the owning process's PID to the lock file on acquisition. When a
lock file already exists, read the PID and check if the process is still
alive via os.kill(pid, 0). If the process is dead, remove the stale lock
and retry immediately — no need to wait for timeout.

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for this module?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arch:windows Related to the Windows OS bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

can’t run multiple flows stored remotely concurrently on windows

1 participant