IngestionQueue failed not getting populated with failed events during clickhouse downtime #10565

rjayaram-godaddy · 2025-11-19T10:02:44Z

rjayaram-godaddy
Nov 19, 2025

I have a self hosted langfuse(v3) setup wherein I use clickhouse on ecs/fargate mounted on efs volumes. It is a single node setup so redeployments cause downtime(this is expected).
I was under the assumption that data ingested during downtime is sent to redis ingestionqueue:failed after worker exhausts retries.

However I could see that this did not happen. There were no failed events during downtime( Confirmed using admin api and also cloudwatch metrics) and the traces sent during downtime are lost.
Some more information

The traces are present in S3 buckets
Redis does not have memory issues
During downtime worker emitted these logs:

error Max attempts reached for observations record. Dropping record.
some errors like this
AggregateError [ECONNREFUSED]

When I tested this flow a few days back, it did work according to what I expected( failed queue was getting populated during downtime)
I recently added this env variable to increase throughput. Could this have caused the above issue? Or are these unrelated?
LANGFUSE_SKIP_INGESTION_CLICKHOUSE_READ_MIN_PROJECT_CREATE_DATE=2021-01-01

Answered by Steffen911

Nov 20, 2025

@rjayaram-godaddy The redis ingestion queue is used to forward events to the worker on which they are further processed. Once the worker processing completes they are added to the ClickHouseWriterQueue. This is an in-memory queue in the worker container that tries to insert events into ClickHouse for multiple times, but eventually gives up. Right now, there is no dead letter queue for this writer queue available.

There is a mechanism to replay specific events from S3 that were sent during a downtime: https://github.com/langfuse/langfuse/tree/main/worker/src/scripts/replayIngestionEvents. You can use that to restore them.

In general, we strongly recommend to use a highly available ClickHou…

View full answer

rjayaram-godaddy · 2025-11-19T10:03:16Z

inkeep-ai-assistant[bot]
bot Nov 19, 2025

Based on the available sources, the behavior you're experiencing is actually expected in Langfuse v3. When ClickHouse is unavailable and the worker exhausts its retry attempts, failed events are intentionally dropped rather than being indefinitely queued(1).

According to the documentation, Langfuse cannot retry indefinitely because "a broken trace or a persistent infrastructure issue would cause ripple effects through the entire platform. Therefore, we stop all retries after a certain number of failures"(1). The logs you're seeing (Max attempts reached for observations record. Dropping record.) confirm this behavior(1).

What Happens During ClickHouse Downtime

Events are stored in S3: The Langfuse Web component accepts traces and stores them in S3/blob storage(1)
Worker processes from queue: The worker picks up jobs from Redis and attempts to write to ClickHouse(1)
Retries are exhausted: When ClickHouse is unavailable, the worker retries a limited number of times before dropping the record(1)
Events are NOT moved to failed queue indefinitely: This is the key point - failed events after max retries are dropped, not preserved in the failed queue(1)

Monitoring and Recovery Options

You have two options to handle this situation(1):

Monitor the Dead Letter Queue (DLQ): You can monitor the length of the dead letter queue using statsd metrics or export to AWS CloudWatch by setting ENABLE_AWS_CLOUDWATCH_METRIC_PUBLISHING=true(2)(1). The DLQ refers to the bull:ingestion-queue:failed queue in Redis(1).
Manual replay from DLQ: If the DLQ contains failed messages, you can retrieve them from Redis and re-emit them after ClickHouse recovers. This requires scripting and is not automatically supported(1).

Regarding Your Environment Variable

The LANGFUSE_SKIP_INGESTION_CLICKHOUSE_READ_MIN_PROJECT_CREATE_DATE variable you added is designed to reduce ClickHouse reads during ingestion processing by skipping the merge operation for newer projects(2). This setting should not cause events to be dropped during downtime - it only affects how events are processed when ClickHouse is available.

The issue you're experiencing is related to the fundamental retry/failure handling design in Langfuse v3, not this specific configuration change.

📚 Sources:

Have another question? Just tag @inkeep.

1 reply

rjayaram-godaddy Nov 19, 2025
Author

@inkeep
I understand that indefinite retry is not an option
But the bull:ingestion-queue:failed queue in Redis is not getting populated during downtime
Why did this happen?

rjayaram-godaddy · 2025-11-19T10:06:11Z

inkeep-ai-assistant[bot]
bot Nov 19, 2025

Based on the available documentation, I don't have specific information explaining why the bull:ingestion-queue:failed queue wouldn't be populated during ClickHouse downtime when the worker exhausts retries.

However, there are a few potential factors to consider:

Queue Sharding Configuration

If you have Redis cluster mode enabled or have configured queue sharding, this could affect how failed jobs are handled(1). The documentation mentions that you can shard queues by setting LANGFUSE_INGESTION_QUEUE_SHARD_COUNT to a value greater than 1(1). When sharding is enabled, failed jobs might be distributed across multiple failed queues rather than a single bull:ingestion-queue:failed queue(1).

Redis Configuration

The documentation emphasizes that you must set maxmemory-policy to noeviction to ensure queue jobs are not evicted from the cache(2)(1). If this isn't configured correctly, Redis might be evicting failed jobs when memory pressure occurs(2).

Monitoring the Failed Queue

To investigate further, you could:

Check if you have queue sharding enabled and verify all shard-specific failed queues
Verify your Redis maxmemory-policy is set to noeviction(2)
Monitor Redis memory usage during downtime to ensure jobs aren't being evicted

Unfortunately, the available documentation doesn't provide detailed troubleshooting steps for this specific scenario. I'd recommend reaching out to Langfuse support with your Redis and worker logs to investigate why failed jobs aren't appearing in the expected failed queue.

📚 Sources:

Have another question? Just tag @inkeep.

1 reply

rjayaram-godaddy Nov 19, 2025
Author

@inkeep

maxmemory-policy is noeviction
queue sharding is not enabled it is single sharded
Redis memory usage during downtime does not show any anomaly

2025-11-19T10:09:04Z

inkeep-ai-assistant[bot]
bot Nov 19, 2025

Given your configuration is correct (noeviction policy, single shard, no Redis memory issues), and the traces are present in S3 but the failed queue isn't being populated, this points to a potential issue with how the worker handles failures when max retries are exhausted.

Based on the error logs you shared (Max attempts reached for observations record. Dropping record.), it appears the worker is dropping records after exhausting retries rather than moving them to the failed queue. This could indicate:

Worker configuration issue: The worker may be configured to drop jobs after max attempts rather than moving them to the DLQ
BullMQ job options: The retry/failure handling behavior depends on how jobs are configured in BullMQ

Unfortunately, the available documentation doesn't provide specific details about:

The exact retry configuration for ingestion jobs
Why jobs would be dropped instead of moved to the failed queue after max attempts
Configuration options to ensure failed jobs are preserved in the DLQ

Recommended Next Steps

Since your infrastructure configuration appears correct, I'd recommend:

Check BullMQ Admin API: Use the queue management endpoints to inspect the current state of your ingestion queue and verify job configurations(1)
Review worker logs: Look for any configuration warnings or errors during worker startup that might indicate misconfigured job options
Contact Langfuse support: This behavior (dropping jobs instead of moving to failed queue) may require investigation of the worker's BullMQ configuration, which isn't detailed in the public documentation(1)

The LANGFUSE_SKIP_INGESTION_CLICKHOUSE_READ_MIN_PROJECT_CREATE_DATE variable you added should not affect this behavior - it only impacts ClickHouse read operations during processing, not the retry/failure handling logic(2).

📚 Sources:

Have another question? Just tag @inkeep.

0 replies

rjayaram-godaddy · 2025-11-20T04:55:35Z

rjayaram-godaddy
Nov 20, 2025
Author

Hi Team, any update on why this could have happened?

0 replies

Steffen911 · 2025-11-20T10:53:44Z

Steffen911
Nov 20, 2025
Maintainer

@rjayaram-godaddy The redis ingestion queue is used to forward events to the worker on which they are further processed. Once the worker processing completes they are added to the ClickHouseWriterQueue. This is an in-memory queue in the worker container that tries to insert events into ClickHouse for multiple times, but eventually gives up. Right now, there is no dead letter queue for this writer queue available.

There is a mechanism to replay specific events from S3 that were sent during a downtime: https://github.com/langfuse/langfuse/tree/main/worker/src/scripts/replayIngestionEvents. You can use that to restore them.

In general, we strongly recommend to use a highly available ClickHouse instance.

4 replies

dahnny012 Nov 20, 2025

Can we make the retry config configurable by ENV as a short term solution?

I only need at most 10 minutes so retry with exponential sleep until 10 minutes should suffice

dahnny012 Nov 20, 2025

i assume this wouldnt be priortized since it only affects self host on ecs. Could you point me to where I can make this change so i can submit a PR

Steffen911 Nov 20, 2025
Maintainer

@dahnny012 You can configure LANGFUSE_INGESTION_CLICKHOUSE_MAX_ATTEMPTS. Currently, it will run it again this many times and retry every 100ms. We would definitively accept a PR that makes the other retry parameters configurable. They would have to be patched here: https://github.com/langfuse/langfuse/blob/main/worker/src/services/ClickhouseWriter/index.ts#L391.

Steffen911 Nov 20, 2025
Maintainer

Keep in mind that a long retry interval would imply a large amount of memory consumption depending on the events that need to be flushed and retried.

Langfuse

IngestionQueue failed not getting populated with failed events during clickhouse downtime #10565

Uh oh!

rjayaram-godaddy Nov 19, 2025

Replies: 5 comments · 6 replies

Uh oh!

inkeep-ai-assistant[bot] bot Nov 19, 2025

What Happens During ClickHouse Downtime

Monitoring and Recovery Options

Regarding Your Environment Variable

Uh oh!

rjayaram-godaddy Nov 19, 2025 Author

Uh oh!

inkeep-ai-assistant[bot] bot Nov 19, 2025

Queue Sharding Configuration

Redis Configuration

Monitoring the Failed Queue

Uh oh!

rjayaram-godaddy Nov 19, 2025 Author

Uh oh!

inkeep-ai-assistant[bot] bot Nov 19, 2025

Recommended Next Steps

Uh oh!

rjayaram-godaddy Nov 20, 2025 Author

Uh oh!

Steffen911 Nov 20, 2025 Maintainer

Uh oh!

dahnny012 Nov 20, 2025

Uh oh!

dahnny012 Nov 20, 2025

Uh oh!

Steffen911 Nov 20, 2025 Maintainer

Uh oh!

Steffen911 Nov 20, 2025 Maintainer

rjayaram-godaddy
Nov 19, 2025

Replies: 5 comments 6 replies

inkeep-ai-assistant[bot]
bot Nov 19, 2025

rjayaram-godaddy Nov 19, 2025
Author

inkeep-ai-assistant[bot]
bot Nov 19, 2025

rjayaram-godaddy Nov 19, 2025
Author

inkeep-ai-assistant[bot]
bot Nov 19, 2025

rjayaram-godaddy
Nov 20, 2025
Author

Steffen911
Nov 20, 2025
Maintainer

Steffen911 Nov 20, 2025
Maintainer

Steffen911 Nov 20, 2025
Maintainer