feat: simultaneous inference #6220

hermit46 · 2025-08-18T15:52:39Z

Describe Your Changes

This PR implements a complete simultaneous inference system that enables users to queue multiple messages while AI models are processing, dramatically improving user experience and workflow efficiency.

🚀 Key Features:

Core Infrastructure:

Message Queueing System: Per-thread FIFO queues allowing unlimited message queuing during AI processing
Inference Scheduler: Coordinates thread processing with race condition prevention and automatic queue processing
Thread State Management: Enhanced state management with processing locks and concurrency controls
Fallback Mode: Single-thread processing mode for current LLaMA.cpp limitations, ready for future parallel support

User Experience Improvements:

Non-blocking Input: Users can continue typing and queuing messages while AI responds
Queue Visibility: Real-time display of queued message count per thread
Automatic Processing: FIFO queue processing when AI becomes available
Seamless UX: Maintains existing chat interface while adding powerful background capabilities

Performance & Testing:

Comprehensive Test Suite: 5,900+ lines of tests covering integration, performance, concurrency, and edge cases
Performance Optimizations: Memoized hooks and optimized state management
Benchmark Infrastructure: Performance regression testing and baseline measurement tools

🔧 Technical Implementation:

Queue Infrastructure (useAppState.ts): Thread-safe message queuing with processing state management
Scheduler Engine (useInferenceScheduler.ts): Coordinates inference requests and prevents race conditions
Chat Integration: Enhanced ChatInput.tsx and useChat.ts with queue functionality
Testing Framework: Comprehensive test coverage for concurrent operations and edge cases

📈 Impact:

Zero Breaking Changes: Fully backward compatible with existing functionality
Future-Ready: Infrastructure prepared for LLaMA.cpp parallel processing when available
Performance Optimized: Improved state management with memoization and selective re-rendering

Fixes Issues

Addresses user frustration with blocked input during AI responses
Resolves workflow interruption when users need to queue multiple questions
Implements foundation for true simultaneous inference capabilities
Fixes React testing act() warnings in ChatInput tests

Self Checklist

Added relevant comments, esp in complex areas
- Comprehensive JSDoc documentation for all hooks and functions
- Detailed inline comments for race condition prevention logic
- Architecture explanation comments for scheduler coordination
Updated docs (for bug fixes / features)
- Created detailed MVP implementation documentation
- Added performance benchmarking and testing guides
- Documented state management patterns and best practices
Created issues for follow-up changes or refactoring needed
- TODO: Integration with LLaMA.cpp parallel flag when available
- TODO: Advanced scheduling algorithms for priority-based processing
- TODO: Queue persistence across app restarts
- TODO: UI migration (~200 LOC) before enabling parallel

📊 Code Statistics

Overall Changes:

Files Modified/Added: 22 files
Lines Added: 6,289 lines
Lines Removed: 88 lines
Net Addition: +6,201 lines

Core Functionality:

Production Code: ~500 lines (queue system + scheduler + integration)
Test Infrastructure: ~5,700 lines (comprehensive test coverage)
Documentation: ~500 lines (comments, JSDoc, architecture docs)

Test Coverage:

Integration Tests: End-to-end queue and processing workflows
Performance Tests: Regression testing and benchmarking
Concurrency Tests: Race condition and thread safety verification
Migration Tests: Backward compatibility validation
Unit Tests: Individual hook and component testing

Important

Introduces a simultaneous inference system with per-thread state management, queue handling, and scheduling, including extensive testing for backward compatibility and performance.

Behavior:
- Implements simultaneous inference system with per-thread FIFO queues and inference scheduler.
- Supports non-blocking input and real-time queue visibility.
- Fallback to single-thread mode for LLaMA.cpp limitations.
State Management:
- Adds per-thread state management in useAppState for prompts, queued messages, and errors.
- Introduces concurrent processing state with methods for managing thread processing.
Scheduler:
- useInferenceScheduler coordinates thread processing with race condition prevention.
- useAutoScheduler automatically triggers scheduling on state changes.
Testing:
- Extensive tests for backward compatibility, performance, and edge cases in useAppState and useInferenceScheduler.
- Includes tests for migration, error recovery, and memory management.
Misc:
- Updates useChat to integrate with new scheduling system.
- Adds utility functions in simultaneous-inference-utils.ts for testing.

^{This description was created by}^{for c10c6d3. You can customize this summary. It will automatically update as commits are pushed.}

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to c10c6d3 in 2 minutes and 21 seconds. Click for details.

Reviewed 6892 lines of code in 22 files
Skipped 0 files when reviewing.
Skipped posting 6 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. web-app/src/hooks/useAppState.ts:177

Draft comment:
There is no implementation for a method called setThreadQueuedMessage, yet migration tests and some new APIs call it. Consider adding it or updating tests to use the existing addToThreadQueue method.
Reason this comment was not posted:
Comment was on unchanged code.

2. web-app/src/hooks/useInferenceScheduler.ts:43

Draft comment:
Consider documenting the expectation for single-thread fallback mode and future extension for parallel scheduling. The logic in determining threadsToStart might need clarity for future developers.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

3. web-app/src/hooks/useChat.ts:245

Draft comment:
After sending a message, the code calls clearThreadPrompt by retrieving it from the global store. Make sure that this approach is safe and does not interfere with other thread prompt states.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

4. web-app/src/test/simultaneous-inference-utils.ts:142

Draft comment:
Utility functions and mocks look comprehensive; consider adding comments for edge-case behavior in createMockSendMessage (e.g., behavior on failureRate conditions) to aid future maintenance.
Reason this comment was not posted:
Confidence changes required: 30% <= threshold 50% None

5. web-app/src/routes/__root.tsx:49

Draft comment:
The integration of useAutoScheduler in the Root layout is clear. Ensure that auto-scheduler’s side effects do not conflict with other router/panel state changes.
Reason this comment was not posted:
Confidence changes required: 30% <= threshold 50% None

6. web-app/src/containers/ChatInput.tsx:117

Draft comment:
Typo detected: the function name 'handleSendMesage' appears to be misspelled. Consider renaming it to 'handleSendMessage' for clarity and consistency.
Reason this comment was not posted:
Comment was on unchanged code.

Workflow ID: wflow_fP7TWoM9zQ6k1K60

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

web-app/src/containers/ChatInput.tsx

hermit46 · 2025-08-18T16:01:09Z

🔧 Concurrency Activation Implementation Guide

Current Status

The simultaneous inference MVP is implemented with infrastructure ready for parallel processing, but currently operates in single-thread mode. To enable actual concurrent processing when llama.cpp supports it, follow these steps:

Step 1: Add Parallel Setting to llamacpp-extension

File: extensions/llamacpp-extension/settings.json
Location: After line 349 (after json_schema_file setting)

{
  "key": "parallel",
  "title": "Parallel Processing",
  "description": "Number of parallel inference requests the model can handle simultaneously. Set to 1 for single-thread mode, higher values for concurrent processing.",
  "controllerType": "input",
  "controllerProps": {
    "value": 1,
    "placeholder": "1",
    "type": "number",
    "min": 1,
    "max": 8,
    "step": 1,
    "textAlign": "right"
  }
}

Step 2: Update `LlamacppConfig` Type

File: extensions/llamacpp-extension/src/index.ts
Location: Line 64, after ctx_shift: boolean

type LlamacppConfig = {
  // ... existing fields ...
  ctx_shift: boolean
  parallel: number  // Add this line
}

Step 3: Connect Extension Config to Web App State

File: web-app/src/hooks/useModelProvider.ts (or similar config loading hook)

// Detect and apply parallel processing configuration
const applyLlamacppConfig = (config: LlamacppConfig) => {
  const { setMaxConcurrency, setParallelProcessingEnabled, setFallbackMode } = useAppState.getState()
  
  if (config.parallel && config.parallel > 1) {
    setMaxConcurrency(config.parallel)
    setParallelProcessingEnabled(true)
    setFallbackMode('user-configured')
  } else {
    setMaxConcurrency(1)
    setParallelProcessingEnabled(false)
    setFallbackMode('single-thread')
  }
}

Step 4: Optional - Auto-Detection from llama.cpp Server

File: web-app/src/services/llamacpp.ts (or create new service)

// Detect parallel capabilities from llama.cpp server
const detectParallelCapabilities = async () => {
  try {
    const response = await fetch('/v1/status')
    const status = await response.json()
    
    if (status.parallel_slots || status.slots) {
      const detectedSlots = status.parallel_slots || status.slots
      const { setMaxConcurrency, setParallelProcessingEnabled, setFallbackMode } = useAppState.getState()
      
      setMaxConcurrency(detectedSlots)
      setParallelProcessingEnabled(detectedSlots > 1)
      setFallbackMode('detected')
      
      return detectedSlots
    }
  } catch (error) {
    console.warn('Could not detect llama.cpp parallel capabilities:', error)
  }
  
  return null
}

Step 5: Update Settings UI (Optional)

The parallel setting will automatically appear in the llamacpp provider settings UI once added to settings.json. No additional UI changes needed.

Activation Flow

When llama.cpp parallel support is available:

User configures: Sets parallel > 1 in llamacpp settings
Extension loads: LlamacppConfig.parallel value loaded
Web app detects: Config loading triggers setMaxConcurrency(parallel)
Scheduler activates: parallelProcessingEnabled = true enables multi-thread processing
Concurrent processing: Multiple threads process simultaneously

Fallback Behavior

No setting: Defaults to single-thread mode (parallel = 1)
Detection fails: Falls back to single-thread mode
User override: Always respects user configuration over auto-detection

Testing Activation

// Verify concurrency is working:
const { maxConcurrency, parallelProcessingEnabled } = useAppState.getState()
console.log(`Max concurrency: ${maxConcurrency}`)
console.log(`Parallel enabled: ${parallelProcessingEnabled}`)

Test with multiple threads:

Queue messages in multiple threads
Verify multiple threads process simultaneously
Check processing state shows multiple active threads

This implementation maintains backward compatibility while providing a clear upgrade path for concurrent processing when llama.cpp supports it.

hermit46 · 2025-08-18T16:27:09Z

Expected behaviors (MVP):

Able to queue multiple messages across threads ✅
Inference prioritization is: Object iteration order (approximately thread creation order)
- Each thread has its own FIFO queue ✅

Screen.Recording.2025-08-18.at.11.57.07.PM.mov

When parallel support for llama.cpp is up, inference can be handled concurrently using the setup guide above.

Our implementation will need to expand to handle more race conditions (see Appendix), but this should be a good stopping point for a code review.

Appendix

❌ Race Conditions That WILL Break with Concurrency > 1:

1. Global State Conflicts

// When 3 threads process simultaneously:
Thread A: updateStreamingContent(contentA)  // ← Overwrites global
Thread B: updateStreamingContent(contentB)  // ← Overwrites Thread A  
Thread C: updateStreamingContent(contentC)  // ← Overwrites Thread B

// Result: UI only shows Thread C's content, A & B lost

2. Shared Resource Conflicts

// Multiple threads calling:
updateTokenSpeed(messageA)  // ← Global token calculation
updateTokenSpeed(messageB)  // ← Overwrites Thread A's speed
updateTokenSpeed(messageC)  // ← Overwrites Thread B's speed

// Result: Token speed calculation is corrupted

3. React State Batching Issues

// Rapid concurrent state updates:
setThreadProcessing("A", false)  // ← Batched
setThreadProcessing("B", false)  // ← Batched  
setThreadProcessing("C", false)  // ← Batched

// React batches these updates → scheduler sees stale state
// Could trigger multiple schedule() calls simultaneously

4. AbortController Conflicts

// Global abort handling:
setAbortController(threadId, controller)  // ← Per-thread (OK)
// But if UI shows global streamingContent, abort might affect wrong thread

🎯 HONEST ASSESSMENT: Current State

✅ What WORKS with Concurrency:

Queue management: Per-thread queues handle multiple threads correctly
Thread locking: setThreadProcessing() prevents double-processing
Message routing: sendMessage(threadId) goes to correct threads
FIFO within threads: Guaranteed by queue structure

❌ What BREAKS with Concurrency:

Streaming display: Global streamingContent will show wrong thread's content
Token speed: Global tokenSpeed gets corrupted by concurrent updates
UI state: Global state overwrites cause display issues
User experience: Complete breakdown of which response belongs where

What We Actually Built:

✅ Perfect infrastructure for concurrent processing
✅ Thread isolation in message routing
✅ Queue management that supports parallelism
❌ UI layer still assumes single-thread global state

What Would Happen if We Set maxConcurrency = 3:

Thread A, B, C all start processing simultaneously
→ Global streamingContent gets overwritten by each thread
→ UI shows garbled mix of responses from different threads  
→ Token speed calculations become meaningless
→ User sees broken, confusing interface

🛠️ To Actually Support Concurrency, We Need:

Thread-Aware UI State (Required)

Fix the streaming display to use per-thread state
Fix token speed to be per-thread
Update all UI components to thread-aware selectors

Concurrent State Update Handling (Required)

Proper React state update batching
Thread-safe state transitions
Cleanup coordination across multiple threads

UI Thread Association (Required)

Clear visual indication of which thread is streaming
Prevent UI confusion with multiple simultaneous responses

📊 LOC Estimate for True Concurrency Support:

UI Migration: ~50-80 lines (6 components × 8-12 lines each)
State Fixes: ~30-50 lines (proper concurrent state handling)
Testing: ~40-60 lines (concurrent scenarios)

Total: ~120-190 additional lines

🎯 Recommendation:

Our current MVP is perfect as-is because:

✅ Infrastructure is concurrency-ready
✅ Single-thread mode works flawlessly
✅ Clear upgrade path when needed
❌ UI layer not ready for actual concurrency

When llama.cpp adds parallel support:

Don't immediately enable it - keep single-thread mode
Complete the UI migration first (thread-aware state)
Then enable concurrency with proper testing
Our per-thread queue design is correct for concurrency - we just need to finish the UI layer migration before enabling it.

louis-menlo · 2025-09-03T07:05:45Z

Hi @hermit46, can you help us rebase to resolve the conflict?

… thread change/unmount

- Create useStreamingContent, useThreadError, useTokenSpeed hooks - Add useQueuedMessages and useThreadQueueLength for queue management - Implement useIsThreadActive and useThreadState for comprehensive thread info - Provide useActiveThreads to get all threads with active state - Use shallow comparison for performance optimization in useThreadState

- Replace global setQueuedMessage with thread-aware removeFromThreadQueue - Update message processing to handle per-thread message queues - Maintain existing functionality while supporting multiple queued messages per thread - Remove dependency on legacy global queue state

- Test multi-message queue per thread functionality - Verify thread isolation and queue persistence across thread switches - Test queue management operations (add, remove, clear) - Validate FIFO processing order and edge case handling - Test integration with convenience hooks for queue management - Ensure performance with large queues and rapid operations

- Test useStreamingContent, useThreadError, useTokenSpeed hooks - Validate useQueuedMessages and useThreadQueueLength functionality - Test useIsThreadActive and useThreadState comprehensive state access - Verify useActiveThreads returns correct active thread list - Ensure hooks properly integrate with useAppState per-thread system

- Test all new per-thread state management methods - Validate thread isolation and state separation - Test streaming content, token speed, and error handling per thread - Verify queue operations (add, remove, clear, length) work correctly - Test thread cleanup and bulk operations - Ensure getAllActiveThreads returns correct active thread list

- Test that legacy global state methods continue to work unchanged - Verify new per-thread methods don't interfere with existing functionality - Test state transitions between legacy and new systems - Ensure backward compatibility is maintained for existing components - Validate that both systems can coexist during transition period

- Test interaction between useAppState and useThreadState hooks - Validate thread state convenience hooks work with app state - Test end-to-end thread management workflows - Verify proper state synchronization across different hooks - Ensure consistent behavior when using multiple thread-aware hooks together

- Test that legacy and new methods are properly separated - Validate method organization and grouping structure - Test that related functionality is logically grouped - Ensure clear separation between backward compatibility and new features - Verify method signatures and behavior match their intended purpose

…mance - Replace useShallow approach with individual hooks for optimal performance - Each hook only re-renders when its specific value changes - Eliminates unnecessary object creation on every render - Maintains useThreadState for backward compatibility when all properties needed - Add comprehensive performance tests demonstrating the optimization benefits - Individual hooks provide stable references and better memory efficiency

- Test object creation frequency between individual hooks and useThreadState - Verify reference stability and memory efficiency improvements - Demonstrate when individual hooks provide better performance - Test real-world usage patterns and frequent re-render scenarios - Provide performance recommendations for optimal hook usage

- Benchmark individual hooks vs useThreadState performance characteristics - Test memory efficiency with 1000+ re-render scenarios - Demonstrate real-world chat component performance patterns - Show object creation frequency and reference stability differences - Provide concrete performance recommendations for developers

- Add queuedMessagesByThread state for per-thread message queues - Implement addToThreadQueue(), removeFromThreadQueue(), getThreadQueueLength() - Add thread processing state management (processingThreads) - Remove obsolete useThreadState.ts functionality (consolidated into useAppState) - Add maxConcurrency and fallbackMode configuration Provides foundation for simultaneous inference with thread-safe queue operations.

- Implement useInferenceScheduler hook for thread processing coordination - Add useAutoScheduler for automatic queue processing triggers - Support both single-thread fallback and future parallel modes - Include race condition prevention and proper thread locking - Add scheduling status monitoring and debugging capabilities Ready for LLaMA.cpp parallel flag when available, currently uses fallback mode.

…inference

…it thread ID support

github-project-automation bot added this to Jan Aug 18, 2025

ellipsis-dev bot reviewed Aug 18, 2025

View reviewed changes

web-app/src/containers/ChatInput.tsx Outdated Show resolved Hide resolved

louis-menlo mentioned this pull request Aug 19, 2025

feat: Add Queued Messages #6135

Closed

louis-menlo self-assigned this Aug 19, 2025

louis-menlo changed the title ~~Feat/simultaneous inference~~ feat: simultaneous inference Aug 21, 2025

louis-menlo moved this to Eng Planning in Jan Aug 21, 2025

louis-menlo moved this from Eng Planning to Todo in Jan Aug 21, 2025

hermit46 added 20 commits September 5, 2025 17:45

feat: add single message queue with Enter key handling and cleanup on…

a38b615

… thread change/unmount

feat: add logic to process queued message after turn

6f7b7ca

feat: enhance ChatInput with multi-message queue support

5621e1e

test: update ChatInput tests for new queue system

47cc18f

feat(chat): integrate message queueing with chat interface

d7fd309

test(mvp): add comprehensive testing infrastructure for simultaneous …

9b77893

…inference

fix(spelling): var spelling (handleSendMessage)

f9b196e

hermit46 added 2 commits September 6, 2025 11:29

fix: merge sendMessage function signatures for attachments and explic…

b29b5eb

…it thread ID support

fix: remove extra deps & param passing bug

f610a66

hermit46 force-pushed the feat/simultaneous-inference branch from bc2fe7b to f610a66 Compare September 6, 2025 08:05

fix: fix tests to be concurrency-aware

1b58501

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: simultaneous inference #6220

feat: simultaneous inference #6220

Uh oh!

hermit46 commented Aug 18, 2025 •

edited

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

Uh oh!

hermit46 commented Aug 18, 2025

Uh oh!

hermit46 commented Aug 18, 2025 •

edited

Loading

Uh oh!

louis-menlo commented Sep 3, 2025

Uh oh!

Uh oh!

feat: simultaneous inference #6220

Are you sure you want to change the base?

feat: simultaneous inference #6220

Uh oh!

Conversation

hermit46 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe Your Changes

🚀 Key Features:

🔧 Technical Implementation:

📈 Impact:

Fixes Issues

Self Checklist

📊 Code Statistics

Overall Changes:

Core Functionality:

Test Coverage:

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hermit46 commented Aug 18, 2025

🔧 Concurrency Activation Implementation Guide

Current Status

Step 1: Add Parallel Setting to llamacpp-extension

Step 2: Update LlamacppConfig Type

Step 3: Connect Extension Config to Web App State

Step 4: Optional - Auto-Detection from llama.cpp Server

Step 5: Update Settings UI (Optional)

Activation Flow

Fallback Behavior

Testing Activation

Test with multiple threads:

Uh oh!

hermit46 commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Appendix

❌ Race Conditions That WILL Break with Concurrency > 1:

1. Global State Conflicts

2. Shared Resource Conflicts

3. React State Batching Issues

4. AbortController Conflicts

🎯 HONEST ASSESSMENT: Current State

✅ What WORKS with Concurrency:

❌ What BREAKS with Concurrency:

What We Actually Built:

What Would Happen if We Set maxConcurrency = 3:

🛠️ To Actually Support Concurrency, We Need:

📊 LOC Estimate for True Concurrency Support:

🎯 Recommendation:

Uh oh!

louis-menlo commented Sep 3, 2025

Uh oh!

Uh oh!

hermit46 commented Aug 18, 2025 •

edited

Loading

Step 2: Update `LlamacppConfig` Type

hermit46 commented Aug 18, 2025 •

edited

Loading