When using Ollama as your LLM provider (instead of OpenAI), you need to configure system-wide environment variables before starting the Ollama service. These settings optimize performance, enable parallel processing, and help manage resource constraints.
Configure these environment variables on your system (not in the Flexible GraphRAG .env file):
OLLAMA_CONTEXT_LENGTH=8192Configuration Options:
- 4096: Minimum for limited resources
- 8192: Recommended default
- 16384: For improved speed and extraction quality
Important Notes:
- The full 128k possible context window for llama3.2:3b requires 16.4GB of RAM for the key-value (KV) cache alone, plus ~3GB for model weights
- The 128K token context window allows processing ~96,240 words of text in a single interaction
- By default, inference engines (llama.cpp, transformers, Ollama) store both model weights and KV cache in GPU VRAM when available (fastest)
- If GPU VRAM is insufficient, the KV cache falls back to system RAM with potential speed penalty
OLLAMA_DEBUG=1Values:
1: Enable debug logging0: Disable debug logging
Log Locations:
- Windows:
C:\Users\<username>\AppData\Local\Ollama\server.log - Linux/macOS: Check Ollama documentation for your platform
Use Cases:
- Checking GPU memory availability
- Identifying CPU fallback behavior
- Troubleshooting performance issues
OLLAMA_KEEP_ALIVE=30mKeeps models loaded in memory for faster subsequent requests. Adjust time based on your usage patterns and available memory.
OLLAMA_MAX_LOADED_MODELS=4Values:
0: No limit (loads as many as needed)4: Recommended for most systems- Adjust based on your available memory
# Windows example
OLLAMA_MODELS=C:\Users\<username>\.ollama\models
# Linux/macOS example
OLLAMA_MODELS=/home/<username>/.ollama/modelsUsually set automatically by Ollama, but can be customized for specific storage locations.
OLLAMA_NUM_PARALLEL=4- Required for Flexible GraphRAG parallel file processing
- Prevents processing errors during parallel document ingestion
- Allows Ollama to handle multiple concurrent requests
- Must match or exceed the number of worker threads used by the system
- Open System Properties → Advanced → Environment Variables
- Under System variables (not User variables), click New
- Add each variable name and value
- Click OK to save
- Restart the Ollama service:
net stop Ollama net start Ollama
-
Add to your shell profile (
~/.bashrc,~/.zshrc, etc.):export OLLAMA_CONTEXT_LENGTH=8192 export OLLAMA_DEBUG=1 export OLLAMA_KEEP_ALIVE=30m export OLLAMA_MAX_LOADED_MODELS=4 export OLLAMA_NUM_PARALLEL=4
-
Reload your shell configuration:
source ~/.bashrc # or ~/.zshrc
-
Restart Ollama service:
systemctl restart ollama # On Linux with systemd # or brew services restart ollama # On macOS with Homebrew
After configuration, verify the settings are active:
-
Check Ollama is running:
ollama list
-
Test with a simple request:
ollama run llama3.2:3b "Hello" -
Check debug logs (if
OLLAMA_DEBUG=1):- Windows:
C:\Users\<username>\AppData\Local\Ollama\server.log - Look for configuration values and GPU/CPU usage information
- Windows:
Symptom: Errors when processing multiple documents simultaneously
Solution: Ensure OLLAMA_NUM_PARALLEL=4 is set system-wide and Ollama service has been restarted
Symptoms:
- Document processing takes much longer than expected
- High CPU usage but low GPU usage
Possible Causes:
- GPU VRAM exhausted: Context window too large for available VRAM
- CPU fallback: Model running on CPU instead of GPU
Solutions:
- Reduce
OLLAMA_CONTEXT_LENGTHto 4096 - Check debug logs for GPU memory issues
- Close other GPU-intensive applications
- Consider using a smaller model (e.g., llama3.2:3b instead of gpt-oss:20b)
Solution:
- Reduce
OLLAMA_CONTEXT_LENGTH - Reduce
OLLAMA_MAX_LOADED_MODELS - Ensure adequate system RAM (16GB+ recommended)
- llama3.2:3b: Lightweight, fast, good for testing
- llama3.1:8b: Balanced performance and quality
- gpt-oss:20b: Higher quality, requires more resources
| Component | Minimum | Recommended | Optimal |
|---|---|---|---|
| System RAM | 8GB | 16GB | 32GB+ |
| GPU VRAM | 4GB | 8GB | 12GB+ |
| Context Length | 4096 | 8192 | 16384 |
OLLAMA_NUM_PARALLEL=4enables 4 concurrent requests- Higher values require more memory but improve throughput
- Match this value to your available resources
Key Points:
- ✓ Set environment variables system-wide (not in Flexible GraphRAG
.env) - ✓
OLLAMA_NUM_PARALLEL=4is critical for parallel processing - ✓ Always restart Ollama service after changing environment variables
- ✓ Use
OLLAMA_DEBUG=1to troubleshoot performance issues - ✓ Adjust
OLLAMA_CONTEXT_LENGTHbased on available resources
These settings ensure optimal Ollama performance with Flexible GraphRAG's parallel document processing capabilities.