Feature/rate limit retry #54

jiange91 · 2025-03-21T04:05:26Z

To get around API provider rate limit, we currently queue all outgoing requests by model type and dispatch them with rate control and retry.

Cognify will start a local server to buffer all litellm requests, port can be set by -p or --rate_limit_port.
The server will maintain per-model rate limit and complete requests accordingly.
The rate limit will be automatically fetched according to the first response for each model.
Added a simple heuristic to adjust the rate limit:
- If raise rate limit error, reduce ticket generate speed by half.
- Otherwise, for each successful response, increase the current rate limit by 1.
Throttled requests will be added to the end of queue and retry.

d001-gensee · 2025-03-21T06:07:55Z

cognify/rate_limiter.py

+            rate_limit_pool[req.model] += 1
+        except RateLimitError as e:
+            # reduce rate limit by half and put to the back of the queue
+            rate_limit_pool[req.model] /= 2


What about refactor this into a backoff function that a user can control the strategy? "/2" may be too aggressive.

d001-gensee · 2025-03-21T06:22:38Z

cognify/rate_limiter.py

+        semaphore.release()
+
+# --- Worker Thread Function ---
+def worker(semaphore, task_queue):


worker is single process multi-thread. I'm not sure if it will becomes the performance bottleneck. Can we write a short piece of code to test its performance limit?

d001-gensee · 2025-03-21T06:24:35Z

cognify/rate_limiter.py

+            result = {"result": {**response.model_dump(), "_hidden_params": response._hidden_params, "_response_headers": response._response_headers}}
+            job_results[job_id] = result
+            # increase rate limit by 1
+            rate_limit_pool[req.model] += 1


This should also refactor into a separate function and let user decide the strategy. In some cases we don't need to increase the rate limit.

d001-gensee · 2025-03-21T06:25:31Z

cognify/rate_limiter.py

+        except RateLimitError as e:
+            # reduce rate limit by half and put to the back of the queue
+            rate_limit_pool[req.model] /= 2
+            task_queue.put((job_id, req))


This assumes the requester won't timeout. It is probably ok for cognify but not necessary ok for a generic rate limiter. Can you add a comment?

d001-gensee · 2025-03-21T06:28:18Z

cognify/llm/litellm_wrapper.py

-        **model_kwargs
-    )
-
+    # response = completion(


If the client does not want rate limiter (e.g., to ease the debugging to avoid another http endpoint, or in the replay mode), can they revert to the non-rate limiter functionality?

jiange91 and others added 2 commits March 20, 2025 20:55

Add client-side rate limit and retry

3698b10

Merge branch 'GenseeAI:main' into feature/rate_limit_retry

29cd973

d001-gensee reviewed Mar 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/rate limit retry #54

Feature/rate limit retry #54

Uh oh!

jiange91 commented Mar 21, 2025

Uh oh!

d001-gensee Mar 21, 2025

Uh oh!

d001-gensee Mar 21, 2025

Uh oh!

d001-gensee Mar 21, 2025

Uh oh!

d001-gensee Mar 21, 2025

Uh oh!

d001-gensee Mar 21, 2025

Uh oh!

Uh oh!

Feature/rate limit retry #54

Are you sure you want to change the base?

Feature/rate limit retry #54

Uh oh!

Conversation

jiange91 commented Mar 21, 2025

Uh oh!

d001-gensee Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

d001-gensee Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

d001-gensee Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

d001-gensee Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

d001-gensee Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!