Add max_upstream_conn parameter for each proxy_cache project by stonezdj · Pull Request #22348 · goharbor/harbor

stonezdj · 2025-09-12T08:42:16Z

Limit the proxy connection to the upstream registry

fixes #22184

Thank you for contributing to Harbor!

Comprehensive Summary of your change

Issue being fixed

Fixes #(issue)

Please indicate you've done the following:

Well Written Title and Summary of the PR
Label the PR as needed. "release-note/ignore-for-release, release-note/new-feature, release-note/update, release-note/enhancement, release-note/community, release-note/breaking-change, release-note/docs, release-note/infra, release-note/deprecation"
Accepted the DCO. Commits without the DCO will delay acceptance.
Made sure tests are passing and test coverage is added if needed.
Considered the docs impact and opened a new docs issue or PR with docs changes if needed in website repository.

codecov · 2025-09-12T08:45:04Z

Codecov Report

❌ Patch coverage is 17.64706% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.87%. Comparing base (c8c11b4) to head (e430ebc).
⚠️ Report is 573 commits behind head on main.

Files with missing lines	Patch %	Lines
src/server/middleware/repoproxy/proxy.go	0.00%	32 Missing ⚠️
src/pkg/project/models/project.go	0.00%	11 Missing ⚠️
src/pkg/proxy/connection/limit.go	46.15%	5 Missing and 2 partials ⚠️
src/server/v2.0/handler/project.go	0.00%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #22348       +/-   ##
===========================================
+ Coverage   45.36%   65.87%   +20.50%     
===========================================
  Files         244     1073      +829     
  Lines       13333   116018   +102685     
  Branches     2719     2927      +208     
===========================================
+ Hits         6049    76427    +70378     
- Misses       6983    35354    +28371     
- Partials      301     4237     +3936

Flag	Coverage Δ
unittests	`65.87% <17.64%> (+20.50%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/server/v2.0/handler/project_metadata.go	`14.28% <100.00%> (ø)`
src/server/v2.0/handler/project.go	`4.87% <0.00%> (ø)`
src/pkg/proxy/connection/limit.go	`46.15% <46.15%> (ø)`
src/pkg/project/models/project.go	`37.25% <0.00%> (ø)`
src/server/middleware/repoproxy/proxy.go	`3.71% <0.00%> (ø)`

... and 982 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Vad1mo · 2025-09-17T14:02:34Z

@Strainy does this replace your PR #22185?

Strainy · 2025-09-17T22:00:25Z

@Strainy does this replace your PR #22185?

I've had a quick scan through, and it seems like the client will experience 429s when the maximum connection limit it reached? This is different to my change (but perhaps complementary?), requests for the same manifests/blobs will queue behind a single request and return the result from the cache when it completes (i.e., no requests failures when there is contention for the same resources).

stonezdj · 2025-09-18T08:02:02Z

@Strainy does this replace your PR #22185?

I've had a quick scan through, and it seems like the client will experience 429s when the maximum connection limit it reached? This is different to my change (but perhaps complementary?), requests for the same manifests/blobs will queue behind a single request and return the result from the cache when it completes (i.e., no requests failures when there is contention for the same resources).

Queue all subsequent requests will consume all tcp connection on the Harbor server, if there are 500 request, then only 1 request can proceed, other 499 will wait, if the first takes 30 minutes, the other should wait 30 minutes. and meanwhile other connection might fail because TCP connection is a limited resource. return 429 is better than blocking the request.

Strainy · 2025-09-18T22:04:54Z

@Strainy does this replace your PR #22185?

I've had a quick scan through, and it seems like the client will experience 429s when the maximum connection limit it reached? This is different to my change (but perhaps complementary?), requests for the same manifests/blobs will queue behind a single request and return the result from the cache when it completes (i.e., no requests failures when there is contention for the same resources).

Queue all subsequent requests will consume all tcp connection on the Harbor server, if there are 500 request, then only 1 request can proceed, other 499 will wait, if the first takes 30 minutes, the other should wait 30 minutes. and meanwhile other connection might fail because TCP connection is a limited resource. return 429 is better than blocking the request.

I am in full agreement that returning a 429 is better than blocking the request. Which is why I feel this change is complementary with mine. This change doesn't address the upstream request deduplication issue that I was originally trying to solve on my branch.

Strainy · 2025-09-19T06:17:21Z

To add a bit more context: I have a use case where we are very sensitive to boot times, particularly for new images that need to be pulled from upstream. We also have a very large number of pods often concurrently hitting the registry for the same artifact.

So I would just like to make sure we're efficient about retrieving resources from the upstream. Which goes further than just rate limiting imo, we should aim to de-duplicate concurrent requests where possible - which is the motivation of my change.

This approach has been working well for us. We're using our fork of Harbor as a DC-local cache for images in Google Artifact Registry. We routinely handle spikes of >1K concurrent pulls with this approach.

But I think if we were to just use this rate limiting approach, we'd have quite a lot of pods just spinning in CrashLoopBackOff... and that'd be a bad time for us.

stonezdj · 2025-09-25T07:19:29Z

To add a bit more context: I have a use case where we are very sensitive to boot times, particularly for new images that need to be pulled from upstream. We also have a very large number of pods often concurrently hitting the registry for the same artifact.

So I would just like to make sure we're efficient about retrieving resources from the upstream. Which goes further than just rate limiting imo, we should aim to de-duplicate concurrent requests where possible - which is the motivation of my change.

This approach has been working well for us. We're using our fork of Harbor as a DC-local cache for images in Google Artifact Registry. We routinely handle spikes of >1K concurrent pulls with this approach.

But I think if we were to just use this rate limiting approach, we'd have quite a lot of pods just spinning in CrashLoopBackOff... and that'd be a bad time for us.

the event is ImagePullBackOff when the image pulling failed, and it will retry pulling image with a back off policy.
I prefer return 429 literaly rather than hang on the connection on server side. if there are 500 requests come to pull the same image at the same time, only 1 is served, but it may take longer time than expected to get the cache ready time, during this interval, other 499 connection is hanging on, because the max connection to a server is a limit resource, it is possible that other client cannot connect to Harbor without any response. it is not a good user experience.

wy65701436

lgtm

reasonerjt · 2025-09-30T05:54:45Z

Could you make Limiter an interface and add some testcases?

limit the proxy connection to upstream registry Signed-off-by: stonezdj <stonezdj@gmail.com>

…r#22348) limit the proxy connection to upstream registry Signed-off-by: stonezdj <stonezdj@gmail.com>

…#22409) limit the proxy connection to upstream registry Signed-off-by: stonezdj <stonezdj@gmail.com>

…r#22348) limit the proxy connection to upstream registry Signed-off-by: stonezdj <stonezdj@gmail.com>

stonezdj requested a review from a team as a code owner September 12, 2025 08:42

github-actions Bot assigned chlins, MinerYang and OrlinVasilev Sep 12, 2025

stonezdj force-pushed the 25aug26_rate_limit_proxycache_upstream branch 2 times, most recently from 59dd2bf to 38e3056 Compare September 12, 2025 08:54

stonezdj added the release-note/enhancement Label to mark PR to be added under release notes as enhancement label Sep 12, 2025

stonezdj requested review from reasonerjt and wy65701436 September 12, 2025 08:55

reasonerjt reviewed Sep 12, 2025

View reviewed changes

Comment thread api/v2.0/swagger.yaml Outdated

reasonerjt added the target/2.14.1 label Sep 12, 2025

stonezdj mentioned this pull request Sep 12, 2025

Add singleflight to deduplicate concurrent blob/manifest requests #22185

Closed

5 tasks

stonezdj force-pushed the 25aug26_rate_limit_proxycache_upstream branch 2 times, most recently from 9077c61 to 8879689 Compare September 15, 2025 02:52

chlins reviewed Sep 19, 2025

View reviewed changes

Comment thread src/pkg/proxy/connection/limit.go Outdated