Skip to content

Conversation

@tvogel
Copy link

@tvogel tvogel commented Sep 25, 2025

(inode *Inode) sealDir() may try to lock the parent via

  • (parent *Inode) removeExpired(from string)
  • (parent *Inode) removeChildUnlocked(inode *Inode) at
	for _, dh := range parent.dir.handles {
		dh.mu.Lock()
		dh.lastInternalOffset = -1
		dh.mu.Unlock()
	}

So, unlock the parent up front in order to avoid a dead-lock.

Caveat: I am not deep enough into the code to say this won't create other problems.

`(inode *Inode) sealDir()` may try to lock the parent via
- `(parent *Inode) removeExpired(from string)`
- `(parent *Inode) removeChildUnlocked(inode *Inode)`
at
```
	for _, dh := range parent.dir.handles {
		dh.mu.Lock()
		dh.lastInternalOffset = -1
		dh.mu.Unlock()
	}
```

So, unlock the parent up front in order to avoid a dead-lock.

Caveat: I am not deep enough into the code to say this won't create other
problems.
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR addresses a critical deadlock issue in the directory listing functionality where the mount could lock up when listing buckets. The deadlock occurred because listObjectsFlat() held a DirHandle mutex while calling sealDir(), which eventually tried to acquire the same mutex through removeChildUnlocked().

Key changes:

  • Unlocks dh.mu before calling dh.inode.sealDir()
  • Re-locks dh.mu immediately after sealDir() completes
  • Prevents the circular locking dependency that caused mount lockups

The fix is minimal, targeted, and directly addresses the root cause identified in issue #37. While the author notes uncertainty about potential side effects, the change follows good locking practices by avoiding holding locks across function calls that may need the same locks.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it fixes a real deadlock issue
  • Score reflects a solid fix for a critical deadlock issue with minimal code changes. The solution is targeted and follows good concurrency practices, though the brief unlock/relock window could theoretically introduce race conditions that warrant careful testing.
  • No files require special attention - the change in core/dir.go is well-targeted and minimal

Important Files Changed

File Analysis

Filename        Score        Overview
core/dir.go 4/5 Fixes a potential deadlock in directory listing by unlocking DirHandle mutex before calling sealDir()

Sequence Diagram

sequenceDiagram
    participant ListThread as listObjectsFlat()
    participant DH as DirHandle
    participant Inode as Inode
    participant Parent as Parent Inode
    participant ChildDH as Child DirHandle

    Note over ListThread: BEFORE FIX - Deadlock scenario
    ListThread->>DH: dh.mu.Lock() (held)
    ListThread->>Inode: dh.inode.sealDir()
    Inode->>Inode: removeExpired("")
    Inode->>Parent: removeChildUnlocked(child)
    Parent->>ChildDH: dh.mu.Lock() (DEADLOCK!)
    Note over ChildDH: Cannot acquire - already held by ListThread

    Note over ListThread: AFTER FIX - Deadlock prevented  
    ListThread->>DH: dh.mu.Lock() (held)
    ListThread->>DH: dh.mu.Unlock() (unlock before sealDir)
    ListThread->>Inode: dh.inode.sealDir()
    Inode->>Inode: removeExpired("")
    Inode->>Parent: removeChildUnlocked(child) 
    Parent->>ChildDH: dh.mu.Lock() (SUCCESS - no conflict)
    Parent->>ChildDH: dh.lastInternalOffset = -1
    Parent->>ChildDH: dh.mu.Unlock()
    ListThread->>DH: dh.mu.Lock() (re-acquire after sealDir)
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@tvogel tvogel changed the title Potential fix for #37 Mount locks up on listing bucket fix: Potential fix for #37 Mount locks up on listing bucket Sep 25, 2025
dh.inode.dir.listMarker = lastName
}
} else {
dh.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your diagnosis is correct that it is a classic case of lock order violation. But there is a race window here in your fix. Between unlocking and relocking dh.mu, the directory handle state could potentially be modified by another thread. It would be a good idea to check where the dh state can change while unlocked.

@ovaistariq
Copy link
Contributor

Can you try the fix in #39 ?

@tvogel
Copy link
Author

tvogel commented Sep 27, 2025

Can you try the fix in #39 ?

I rebased your fix onto release (primarily because my reproducer still uses --no-tigris-prefetch) and can confirm, it fixes #37! Great!

As a side note, I tried that problematic unlock/relock workaround after having seen a similar pattern in the existing code which you might want to review:

parent.mu.Unlock()

parent.mu.Unlock()

@ovaistariq
Copy link
Contributor

Can you try the fix in #39 ?

I rebased your fix onto release (primarily because my reproducer still uses --no-tigris-prefetch) and can confirm, it fixes #37! Great!

As a side note, I tried that problematic unlock/relock workaround after having seen a similar pattern in the existing code which you might want to review:

parent.mu.Unlock()

parent.mu.Unlock()

Thanks. I will take a look at that too. Closing this PR.

@ovaistariq ovaistariq closed this Sep 27, 2025
@ovaistariq
Copy link
Contributor

Can you try the fix in #39 ?

I rebased your fix onto release (primarily because my reproducer still uses --no-tigris-prefetch) and can confirm, it fixes #37! Great!

As a side note, I tried that problematic unlock/relock workaround after having seen a similar pattern in the existing code which you might want to review:

parent.mu.Unlock()

parent.mu.Unlock()

Fixed it in #43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants