-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
The official OpenCTI documentation does not currently warn administrators about the destructive impact of running FLUSHDB or FLUSHALL on a live OpenCTI Redis instance, nor does it provide recovery procedures when this occurs.
This is a recurring operational scenario — when Redis memory grows unexpectedly (often due to connector queue backlogs), administrators may attempt to resolve it by flushing Redis. This destroys critical platform state and creates a cascading failure that is difficult to diagnose without understanding OpenCTI's internal architecture.
Problem
OpenCTI uses Redis for:
- Work tracking — each connector ingestion job is tracked via work IDs stored in Redis
- Distributed locks — preventing duplicate entity creation during concurrent ingestion
- Stream coordination — live stream and TAXII data sharing state
- Caching — API response caching and session data
Running FLUSHDB or FLUSHALL destroys all of this state. However, RabbitMQ queues survive (they're in a separate system), creating orphaned bundles that reference work IDs that no longer exist.
The failure chain
- Redis is flushed → all work-tracking state is destroyed
- RabbitMQ still has queued bundles referencing now-dead work IDs
- Workers dequeue bundles → attempt to update work status → Redis returns "work doesn't exist"
- Platform throws
WORK_NOT_ALIVEerrors (Work is no longer alive, no request can be done within the context of this work) - Workers cannot complete bundles → retry or stall
- Result: CPU burn on ingest nodes with zero Elasticsearch writes, massive queue backlog that never drains
Symptoms
WORK_NOT_ALIVEerrors in platform/worker logs- Queue backlog growing or not draining despite healthy infrastructure
- Elasticsearch idle (zero write rejections, zero active merges) despite large queue
- Ingest node CPU imbalance — some pods hot (retry loops), others idle
- Works stuck "In Progress" with zero completed operations
Requested Documentation
A section in the Troubleshooting page (or a dedicated page) covering:
1. Warning: Never run FLUSHDB/FLUSHALL on a live OpenCTI Redis
- What state is stored in Redis and why it's critical
- What happens when it's destroyed (the failure chain above)
2. Recovery Procedure
When FLUSHDB has already been run:
- Purge stale connector queues in RabbitMQ (bundles referencing dead work IDs)
- Reset affected connector state in OpenCTI
- Restart ingest/worker pods
- Restart platform pods
- Restart connectors (they will create new work IDs)
- Monitor for
WORK_NOT_ALIVEerrors clearing
3. Safe Alternatives When Redis Memory Is High
- Set
maxmemorywithnoevictionpolicy to prevent unbounded growth - Use
redis-cli --bigkeysto identify what's consuming memory - Use stream trimming for event stream growth
- Purge specific RabbitMQ queues (not Redis) for connector backlogs
- Surgical key deletion for specific stuck locks
4. Recommended Redis Configuration
maxmemory— should be set explicitly (not rely on container OOM)maxmemory-policy— must benoeviction- Monitoring thresholds for memory, blocked clients, slowlog
Context
This documentation request is based on a real production incident at a customer site with 161M+ documents, 6 ingest nodes, and 12 workers. The FLUSHDB was run to address Redis memory pressure, which caused a 2+ week outage of ingestion processing. The root cause was non-obvious — all infrastructure components (disk I/O, Elasticsearch, Redis itself) appeared healthy, but the write pipeline was completely stalled due to orphaned work IDs.
The existing FAQ entry in internal documentation ("Is it safe to flush the redis cache?") provides a brief warning but lacks the failure chain explanation, symptoms, recovery procedure, and safe alternatives needed for operational use.
Labels
Documentation, Troubleshooting