Skip to content

[Documentation] Add Redis State Management & FLUSHDB/FLUSHALL Warning to Troubleshooting Guide #14881

@CTIBurn0ut

Description

@CTIBurn0ut

Description

The official OpenCTI documentation does not currently warn administrators about the destructive impact of running FLUSHDB or FLUSHALL on a live OpenCTI Redis instance, nor does it provide recovery procedures when this occurs.

This is a recurring operational scenario — when Redis memory grows unexpectedly (often due to connector queue backlogs), administrators may attempt to resolve it by flushing Redis. This destroys critical platform state and creates a cascading failure that is difficult to diagnose without understanding OpenCTI's internal architecture.

Problem

OpenCTI uses Redis for:

  • Work tracking — each connector ingestion job is tracked via work IDs stored in Redis
  • Distributed locks — preventing duplicate entity creation during concurrent ingestion
  • Stream coordination — live stream and TAXII data sharing state
  • Caching — API response caching and session data

Running FLUSHDB or FLUSHALL destroys all of this state. However, RabbitMQ queues survive (they're in a separate system), creating orphaned bundles that reference work IDs that no longer exist.

The failure chain

  1. Redis is flushed → all work-tracking state is destroyed
  2. RabbitMQ still has queued bundles referencing now-dead work IDs
  3. Workers dequeue bundles → attempt to update work status → Redis returns "work doesn't exist"
  4. Platform throws WORK_NOT_ALIVE errors (Work is no longer alive, no request can be done within the context of this work)
  5. Workers cannot complete bundles → retry or stall
  6. Result: CPU burn on ingest nodes with zero Elasticsearch writes, massive queue backlog that never drains

Symptoms

  • WORK_NOT_ALIVE errors in platform/worker logs
  • Queue backlog growing or not draining despite healthy infrastructure
  • Elasticsearch idle (zero write rejections, zero active merges) despite large queue
  • Ingest node CPU imbalance — some pods hot (retry loops), others idle
  • Works stuck "In Progress" with zero completed operations

Requested Documentation

A section in the Troubleshooting page (or a dedicated page) covering:

1. Warning: Never run FLUSHDB/FLUSHALL on a live OpenCTI Redis

  • What state is stored in Redis and why it's critical
  • What happens when it's destroyed (the failure chain above)

2. Recovery Procedure

When FLUSHDB has already been run:

  1. Purge stale connector queues in RabbitMQ (bundles referencing dead work IDs)
  2. Reset affected connector state in OpenCTI
  3. Restart ingest/worker pods
  4. Restart platform pods
  5. Restart connectors (they will create new work IDs)
  6. Monitor for WORK_NOT_ALIVE errors clearing

3. Safe Alternatives When Redis Memory Is High

  • Set maxmemory with noeviction policy to prevent unbounded growth
  • Use redis-cli --bigkeys to identify what's consuming memory
  • Use stream trimming for event stream growth
  • Purge specific RabbitMQ queues (not Redis) for connector backlogs
  • Surgical key deletion for specific stuck locks

4. Recommended Redis Configuration

  • maxmemory — should be set explicitly (not rely on container OOM)
  • maxmemory-policy — must be noeviction
  • Monitoring thresholds for memory, blocked clients, slowlog

Context

This documentation request is based on a real production incident at a customer site with 161M+ documents, 6 ingest nodes, and 12 workers. The FLUSHDB was run to address Redis memory pressure, which caused a 2+ week outage of ingestion processing. The root cause was non-obvious — all infrastructure components (disk I/O, Elasticsearch, Redis itself) appeared healthy, but the write pipeline was completely stalled due to orphaned work IDs.

The existing FAQ entry in internal documentation ("Is it safe to flush the redis cache?") provides a brief warning but lacks the failure chain explanation, symptoms, recovery procedure, and safe alternatives needed for operational use.

Labels

Documentation, Troubleshooting

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions