IngestionQueue failed not getting populated with failed events during clickhouse downtime #10565
-
|
I have a self hosted langfuse(v3) setup wherein I use clickhouse on ecs/fargate mounted on efs volumes. It is a single node setup so redeployments cause downtime(this is expected). However I could see that this did not happen. There were no failed events during downtime( Confirmed using admin api and also cloudwatch metrics) and the traces sent during downtime are lost.
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 6 replies
-
|
Based on the available sources, the behavior you're experiencing is actually expected in Langfuse v3. When ClickHouse is unavailable and the worker exhausts its retry attempts, failed events are intentionally dropped rather than being indefinitely queued(1). According to the documentation, Langfuse cannot retry indefinitely because "a broken trace or a persistent infrastructure issue would cause ripple effects through the entire platform. Therefore, we stop all retries after a certain number of failures"(1). The logs you're seeing ( What Happens During ClickHouse Downtime
Monitoring and Recovery OptionsYou have two options to handle this situation(1):
Regarding Your Environment VariableThe The issue you're experiencing is related to the fundamental retry/failure handling design in Langfuse v3, not this specific configuration change. 📚 Sources:
Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
Based on the available documentation, I don't have specific information explaining why the However, there are a few potential factors to consider: Queue Sharding ConfigurationIf you have Redis cluster mode enabled or have configured queue sharding, this could affect how failed jobs are handled(1). The documentation mentions that you can shard queues by setting Redis ConfigurationThe documentation emphasizes that you must set Monitoring the Failed QueueTo investigate further, you could:
Unfortunately, the available documentation doesn't provide detailed troubleshooting steps for this specific scenario. I'd recommend reaching out to Langfuse support with your Redis and worker logs to investigate why failed jobs aren't appearing in the expected failed queue. 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
Given your configuration is correct (noeviction policy, single shard, no Redis memory issues), and the traces are present in S3 but the failed queue isn't being populated, this points to a potential issue with how the worker handles failures when max retries are exhausted. Based on the error logs you shared (
Unfortunately, the available documentation doesn't provide specific details about:
Recommended Next StepsSince your infrastructure configuration appears correct, I'd recommend:
The 📚 Sources: Have another question? Just tag @inkeep. |
Beta Was this translation helpful? Give feedback.
-
|
Hi Team, any update on why this could have happened? |
Beta Was this translation helpful? Give feedback.
-
|
@rjayaram-godaddy The redis ingestion queue is used to forward events to the worker on which they are further processed. Once the worker processing completes they are added to the There is a mechanism to replay specific events from S3 that were sent during a downtime: https://github.com/langfuse/langfuse/tree/main/worker/src/scripts/replayIngestionEvents. You can use that to restore them. In general, we strongly recommend to use a highly available ClickHouse instance. |
Beta Was this translation helpful? Give feedback.
@rjayaram-godaddy The redis ingestion queue is used to forward events to the worker on which they are further processed. Once the worker processing completes they are added to the
ClickHouseWriterQueue. This is an in-memory queue in the worker container that tries to insert events into ClickHouse for multiple times, but eventually gives up. Right now, there is no dead letter queue for this writer queue available.There is a mechanism to replay specific events from S3 that were sent during a downtime: https://github.com/langfuse/langfuse/tree/main/worker/src/scripts/replayIngestionEvents. You can use that to restore them.
In general, we strongly recommend to use a highly available ClickHou…