-
Notifications
You must be signed in to change notification settings - Fork 2.2k
feat: simultaneous inference #6220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed everything up to c10c6d3 in 2 minutes and 21 seconds. Click for details.
- Reviewed
6892
lines of code in22
files - Skipped
0
files when reviewing. - Skipped posting
6
draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. web-app/src/hooks/useAppState.ts:177
- Draft comment:
There is no implementation for a method called setThreadQueuedMessage, yet migration tests and some new APIs call it. Consider adding it or updating tests to use the existing addToThreadQueue method. - Reason this comment was not posted:
Comment was on unchanged code.
2. web-app/src/hooks/useInferenceScheduler.ts:43
- Draft comment:
Consider documenting the expectation for single-thread fallback mode and future extension for parallel scheduling. The logic in determining threadsToStart might need clarity for future developers. - Reason this comment was not posted:
Confidence changes required:50%
<= threshold50%
None
3. web-app/src/hooks/useChat.ts:245
- Draft comment:
After sending a message, the code calls clearThreadPrompt by retrieving it from the global store. Make sure that this approach is safe and does not interfere with other thread prompt states. - Reason this comment was not posted:
Confidence changes required:50%
<= threshold50%
None
4. web-app/src/test/simultaneous-inference-utils.ts:142
- Draft comment:
Utility functions and mocks look comprehensive; consider adding comments for edge-case behavior in createMockSendMessage (e.g., behavior on failureRate conditions) to aid future maintenance. - Reason this comment was not posted:
Confidence changes required:30%
<= threshold50%
None
5. web-app/src/routes/__root.tsx:49
- Draft comment:
The integration of useAutoScheduler in the Root layout is clear. Ensure that auto-scheduler’s side effects do not conflict with other router/panel state changes. - Reason this comment was not posted:
Confidence changes required:30%
<= threshold50%
None
6. web-app/src/containers/ChatInput.tsx:117
- Draft comment:
Typo detected: the function name 'handleSendMesage' appears to be misspelled. Consider renaming it to 'handleSendMessage' for clarity and consistency. - Reason this comment was not posted:
Comment was on unchanged code.
Workflow ID: wflow_fP7TWoM9zQ6k1K60
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
🔧 Concurrency Activation Implementation GuideCurrent StatusThe simultaneous inference MVP is implemented with infrastructure ready for parallel processing, but currently operates in single-thread mode. To enable actual concurrent processing when llama.cpp supports it, follow these steps: Step 1: Add Parallel Setting to llamacpp-extensionFile: {
"key": "parallel",
"title": "Parallel Processing",
"description": "Number of parallel inference requests the model can handle simultaneously. Set to 1 for single-thread mode, higher values for concurrent processing.",
"controllerType": "input",
"controllerProps": {
"value": 1,
"placeholder": "1",
"type": "number",
"min": 1,
"max": 8,
"step": 1,
"textAlign": "right"
}
} Step 2: Update
|
Expected behaviors (MVP):
Screen.Recording.2025-08-18.at.11.57.07.PM.movWhen parallel support for llama.cpp is up, inference can be handled concurrently using the setup guide above. Our implementation will need to expand to handle more race conditions (see Appendix), but this should be a good stopping point for a code review. Appendix❌ Race Conditions That WILL Break with Concurrency > 1:1. Global State Conflicts// When 3 threads process simultaneously:
Thread A: updateStreamingContent(contentA) // ← Overwrites global
Thread B: updateStreamingContent(contentB) // ← Overwrites Thread A
Thread C: updateStreamingContent(contentC) // ← Overwrites Thread B
// Result: UI only shows Thread C's content, A & B lost 2. Shared Resource Conflicts// Multiple threads calling:
updateTokenSpeed(messageA) // ← Global token calculation
updateTokenSpeed(messageB) // ← Overwrites Thread A's speed
updateTokenSpeed(messageC) // ← Overwrites Thread B's speed
// Result: Token speed calculation is corrupted 3. React State Batching Issues// Rapid concurrent state updates:
setThreadProcessing("A", false) // ← Batched
setThreadProcessing("B", false) // ← Batched
setThreadProcessing("C", false) // ← Batched
// React batches these updates → scheduler sees stale state
// Could trigger multiple schedule() calls simultaneously 4. AbortController Conflicts// Global abort handling:
setAbortController(threadId, controller) // ← Per-thread (OK)
// But if UI shows global streamingContent, abort might affect wrong thread 🎯 HONEST ASSESSMENT: Current State✅ What WORKS with Concurrency:
❌ What BREAKS with Concurrency:
What We Actually Built:
What Would Happen if We Set maxConcurrency = 3:Thread A, B, C all start processing simultaneously
→ Global streamingContent gets overwritten by each thread
→ UI shows garbled mix of responses from different threads
→ Token speed calculations become meaningless
→ User sees broken, confusing interface 🛠️ To Actually Support Concurrency, We Need:
📊 LOC Estimate for True Concurrency Support:
Total: ~120-190 additional lines 🎯 Recommendation:Our current MVP is perfect as-is because:
When llama.cpp adds parallel support:
|
Hi @hermit46, can you help us rebase to resolve the conflict? |
… thread change/unmount
- Create useStreamingContent, useThreadError, useTokenSpeed hooks - Add useQueuedMessages and useThreadQueueLength for queue management - Implement useIsThreadActive and useThreadState for comprehensive thread info - Provide useActiveThreads to get all threads with active state - Use shallow comparison for performance optimization in useThreadState
- Replace global setQueuedMessage with thread-aware removeFromThreadQueue - Update message processing to handle per-thread message queues - Maintain existing functionality while supporting multiple queued messages per thread - Remove dependency on legacy global queue state
- Test multi-message queue per thread functionality - Verify thread isolation and queue persistence across thread switches - Test queue management operations (add, remove, clear) - Validate FIFO processing order and edge case handling - Test integration with convenience hooks for queue management - Ensure performance with large queues and rapid operations
- Test useStreamingContent, useThreadError, useTokenSpeed hooks - Validate useQueuedMessages and useThreadQueueLength functionality - Test useIsThreadActive and useThreadState comprehensive state access - Verify useActiveThreads returns correct active thread list - Ensure hooks properly integrate with useAppState per-thread system
- Test all new per-thread state management methods - Validate thread isolation and state separation - Test streaming content, token speed, and error handling per thread - Verify queue operations (add, remove, clear, length) work correctly - Test thread cleanup and bulk operations - Ensure getAllActiveThreads returns correct active thread list
- Test that legacy global state methods continue to work unchanged - Verify new per-thread methods don't interfere with existing functionality - Test state transitions between legacy and new systems - Ensure backward compatibility is maintained for existing components - Validate that both systems can coexist during transition period
- Test interaction between useAppState and useThreadState hooks - Validate thread state convenience hooks work with app state - Test end-to-end thread management workflows - Verify proper state synchronization across different hooks - Ensure consistent behavior when using multiple thread-aware hooks together
- Test that legacy and new methods are properly separated - Validate method organization and grouping structure - Test that related functionality is logically grouped - Ensure clear separation between backward compatibility and new features - Verify method signatures and behavior match their intended purpose
…mance - Replace useShallow approach with individual hooks for optimal performance - Each hook only re-renders when its specific value changes - Eliminates unnecessary object creation on every render - Maintains useThreadState for backward compatibility when all properties needed - Add comprehensive performance tests demonstrating the optimization benefits - Individual hooks provide stable references and better memory efficiency
- Test object creation frequency between individual hooks and useThreadState - Verify reference stability and memory efficiency improvements - Demonstrate when individual hooks provide better performance - Test real-world usage patterns and frequent re-render scenarios - Provide performance recommendations for optimal hook usage
- Benchmark individual hooks vs useThreadState performance characteristics - Test memory efficiency with 1000+ re-render scenarios - Demonstrate real-world chat component performance patterns - Show object creation frequency and reference stability differences - Provide concrete performance recommendations for developers
- Add queuedMessagesByThread state for per-thread message queues - Implement addToThreadQueue(), removeFromThreadQueue(), getThreadQueueLength() - Add thread processing state management (processingThreads) - Remove obsolete useThreadState.ts functionality (consolidated into useAppState) - Add maxConcurrency and fallbackMode configuration Provides foundation for simultaneous inference with thread-safe queue operations.
- Implement useInferenceScheduler hook for thread processing coordination - Add useAutoScheduler for automatic queue processing triggers - Support both single-thread fallback and future parallel modes - Include race condition prevention and proper thread locking - Add scheduling status monitoring and debugging capabilities Ready for LLaMA.cpp parallel flag when available, currently uses fallback mode.
bc2fe7b
to
f610a66
Compare
Describe Your Changes
This PR implements a complete simultaneous inference system that enables users to queue multiple messages while AI models are processing, dramatically improving user experience and workflow efficiency.
🚀 Key Features:
Core Infrastructure:
User Experience Improvements:
Performance & Testing:
🔧 Technical Implementation:
📈 Impact:
Fixes Issues
Self Checklist
📊 Code Statistics
Overall Changes:
Core Functionality:
Test Coverage:
Important
Introduces a simultaneous inference system with per-thread state management, queue handling, and scheduling, including extensive testing for backward compatibility and performance.
useAppState
for prompts, queued messages, and errors.useInferenceScheduler
coordinates thread processing with race condition prevention.useAutoScheduler
automatically triggers scheduling on state changes.useAppState
anduseInferenceScheduler
.useChat
to integrate with new scheduling system.simultaneous-inference-utils.ts
for testing.This description was created by
for c10c6d3. You can customize this summary. It will automatically update as commits are pushed.