Really slow latency for viewing traces within a timestamp and viewing a specific trace. #8675

dahnny012 · 2025-08-21T20:40:55Z

dahnny012
Aug 21, 2025

Config:

Self host on AWS.
1x Web and 1x worker on ECS Fargate with 4cpu and 16gb memory
1x Clickhouse on ECS Fargate with 4cpu and 16gb memory
Clickhouse ecs is attached with a efs and thats only at 6gbs so far.
Using Serverless Aurora RDS with 2 capacity
Clickhouse version v25.1, Langfuse v3.99
Set NODE_OPTIONS: --max-old-space-size=${memoryLimitMiB}, on my langfuse ecs container, set ALB timeout to +10s

Problem:

I get abysmal latency on the UI when getting the list of traces in the last 24 hours. Some of the rows wont populate and loading a single trace takes on average 15s. In AWS I'm not seeing any metrics higher than 17% for RAM and CPU usage.

This issue started happening as we put in more traces into the system. We're currently at 6gb of traces (based on EFS)

2025-08-21T20:41:14Z

inkeep-ai-assistant[bot]
bot Aug 21, 2025

There is no direct information in the available sources about troubleshooting slow UI latency for viewing traces or single trace details in a self-hosted Langfuse setup with your described AWS/ClickHouse configuration. The documentation covers how traces are queried and displayed, and describes that the Langfuse UI and APIs rely on aggregating and querying trace data, often involving complex queries, especially for large datasets(1)(2)(3). There is also mention that the query engine is designed for flexibility and performance(2).

However, there are no explicit guidelines or troubleshooting steps for performance tuning or diagnosing latency issues in your specific architecture. If you would like, I can try to provide a general answer based on typical Langfuse and ClickHouse performance practices.

Check out these potentially useful sources:

Would you like a general answer based on best practices for performance in Langfuse deployments?

📚 Sources:

Have another question? Just tag @inkeep.

0 replies

dahnny012 · 2025-08-22T01:07:14Z

dahnny012
Aug 22, 2025
Author

So i wiped out my clickhouse and reset back to stock and the performance is back to normal. I had emitted about 10k traces before this.

0 replies

dahnny012 · 2025-08-22T01:20:58Z

dahnny012
Aug 22, 2025
Author

SELECT 
    query,
    query_duration_ms / 1000.0 AS duration_sec,
    read_rows,
    read_bytes,
    result_rows,
    result_bytes,
    query_start_time
FROM system.query_log
WHERE event_date = today()
  AND type = 'QueryFinish'
  AND NOT startsWith(query, 'INSERT')
ORDER BY duration_sec DESC
LIMIT 15;

13.320s] 207 rows read | 10.50 MB read | 1 result rows | started 2025-08-22 01:10:05
Query: SELECT id, trace_id, project_id, environment, type, parent_observation_id, start_time, end_time, name, metadata, level, status_message, version, input, output, provided_model_name, internal_model_id, model_parameters, provided_usage_details, usage_details, provided_cost_details, cost_details, total_cost, completion_start_time, prompt_id, prompt_name, prompt_version, created_at, updated_at, event_ts FROM observations WHERE (id = '719484534308e5ae') AND (project_id = 'cmdqcn9150008ad060i92351p') AND (toDate(start_time) = toDate(_CAST('1755824761.64', 'DateTime64(3)'))) AND (trace_id = '99b47c2602a551ff5c3e78349997ce61') ORDER BY event_ts DESC LIMIT 1 BY id, project_id FORMAT JSONEachRow

[12.514s] 292 rows read | 0.10 MB read | 29 result rows | started 2025-08-22 01:10:57
Query: SELECT id, trace_id, project_id, type, parent_observation_id, environment, start_time, end_time, name, level, status_message, version, provided_model_name, internal_model_id, model_parameters, provided_usage_details, usage_details, provided_cost_details, cost_details, total_cost, completion_start_time, prompt_id, prompt_name, prompt_version, created_at, updated_at, event_ts FROM observations WHERE (trace_id = '22cc7b9b9eec91a8968c25db26864e18') AND (project_id = 'cmdqcn9150008ad060i92351p') AND (start_time >= (_CAST('1755824675.633', 'DateTime64(3)') - toIntervalHour(1))) ORDER BY event_ts DESC LIMIT 1 BY id, project_id FORMAT JSONEachRow

[10.245s] 196 rows read | 1.96 MB read | 1 result rows | started 2025-08-22 01:13:07
Query: SELECT id, trace_id, project_id, environment, type, parent_observation_id, start_time, end_time, name, metadata, level, status_message, version, input, output, provided_model_name, internal_model_id, model_parameters, provided_usage_details, usage_details, provided_cost_details, cost_details, total_cost, completion_start_time, prompt_id, prompt_name, prompt_version, created_at, updated_at, event_ts FROM observations WHERE (id = 'addd1df555f1e835') AND (project_id = 'cmdqcn9150008ad060i92351p') AND (toDate(start_time) = toDate(_CAST('1755824741.654', 'DateTime64(3)'))) AND (trace_id = '99b47c2602a551ff5c3e78349997ce61') ORDER BY event_ts DESC LIMIT 1 BY id, project_id FORMAT JSONEachRow

[9.029s] 207 rows read | 10.50 MB read | 1 result rows | started 2025-08-22 01:10:06
Query: SELECT id, trace_id, project_id, environment, type, parent_observation_id, start_time, end_time, name, metadata, level, status_message, version, input, output, provided_model_name, internal_model_id, model_parameters, provided_usage_details, usage_details, provided_cost_details, cost_details, total_cost, completion_start_time, prompt_id, prompt_name, prompt_version, created_at, updated_at, event_ts FROM observations WHERE (id = '92ca832e0918f3ae') AND (project_id = 'cmdqcn9150008ad060i92351p') AND (toDate(start_time) = toDate(_CAST('1755824761.323', 'DateTime64(3)'))) AND (trace_id = '99b47c2602a551ff5c3e78349997ce61') ORDER BY event_ts DESC LIMIT 1 BY id, project_id FORMAT JSONEachRow

[8.713s] 339 rows read | 0.12 MB read | 76 result rows | started 2025-08-22 01:12:56
Query: SELECT id, trace_id, project_id, type, parent_observation_id, environment, start_time, end_time, name, level, status_message, version, provided_model_name, internal_model_id, model_parameters, provided_usage_details, usage_details, provided_cost_details, cost_details, total_cost, completion_start_time, prompt_id, prompt_name, prompt_version, created_at, updated_at, event_ts FROM observations WHERE (trace_id = '99b47c2602a551ff5c3e78349997ce61') AND (project_id = 'cmdqcn9150008ad060i92351p') AND (start_time >= (_CAST('1755824719.066', 'DateTime64(3)') - toIntervalHour(1))) ORDER BY event_ts DESC LIMIT 1 BY id, project_id FORMAT JSONEachRow

[8.385s] 207 rows read | 0.77 MB read | 1 result rows | started 2025-08-22 01:10:07
Query: SELECT id, trace_id, project_id, environment, type, parent_observation_id, start_time, end_time, name, metadata, level, status_message, version, input, output, provided_model_name, internal_model_id, model_parameters, provided_usage_details, usage_details, provided_cost_details, cost_details, total_cost, completion_start_time, prompt_id, prompt_name, prompt_version, created_at, updated_at, event_ts FROM observations WHERE (id = '46398663a0d6bd03') AND (project_id = 'cmdqcn9150008ad060i92351p') AND (toDate(start_time) = toDate(_CAST('1755824759.071', 'DateTime64(3)'))) AND (trace_id = '99b47c2602a551ff5c3e78349997ce61') ORDER BY event_ts DESC LIMIT 1 BY id, project_id FORMAT JSONEachRow

[7.703s] 207 rows read | 10.50 MB read | 1 result rows | started 2025-08-22 01:10:17
Query: SELECT id, trace_id, project_id, environment, type, parent_observation_id, start_time, end_time, name, metadata, level, status_message, version, input, output, provided_model_name, internal_model_id, model_parameters, provided_usage_details, usage_details, provided_cost_details, cost_details, total_cost, completion_start_time, prompt_id, prompt_name, prompt_version, created_at, updated_at, event_ts FROM observations WHERE (id = '92ca832e0918f3ae') AND (project_id = 'cmdqcn9150008ad060i92351p') AND (toDate(start_time) = toDate(_CAST('1755824761.323', 'DateTime64(3)'))) AND (trace_id = '99b47c2602a551ff5c3e78349997ce61') ORDER BY event_ts DESC LIMIT 1 BY id, project_id FORMAT JSONEachRow

0 replies

dahnny012 · 2025-08-22T02:10:22Z

dahnny012
Aug 22, 2025
Author

[("CREATE TABLE default.observations
(
    `id` String,
    `trace_id` String,
    `project_id` String,
    `environment` LowCardinality(String) DEFAULT 'default',
    `type` LowCardinality(String),
    `parent_observation_id` Nullable(String),
    `start_time` DateTime64(3),
    `end_time` Nullable(DateTime64(3)),
    `name` String,
    `metadata` Map(LowCardinality(String), String),
    `level` LowCardinality(String),
    `status_message` Nullable(String),
    `version` Nullable(String),
    `input` Nullable(String) CODEC(ZSTD(3)),
    `output` Nullable(String) CODEC(ZSTD(3)),
    `provided_model_name` Nullable(String),
    `internal_model_id` Nullable(String),
    `model_parameters` Nullable(String),
    `provided_usage_details` Map(LowCardinality(String), UInt64),
    `usage_details` Map(LowCardinality(String), UInt64),
    `provided_cost_details` Map(LowCardinality(String), Decimal(18, 12)),
    `cost_details` Map(LowCardinality(String), Decimal(18, 12)),
    `total_cost` Nullable(Decimal(18, 12)),
    `completion_start_time` Nullable(DateTime64(3)),
    `prompt_id` Nullable(String),
    `prompt_name` Nullable(String),
    `prompt_version` Nullable(UInt16),
    `created_at` DateTime64(3) DEFAULT now(),
    `updated_at` DateTime64(3) DEFAULT now(),
    `event_ts` DateTime64(3),
    `is_deleted` UInt8,
    INDEX idx_id id TYPE bloom_filter() GRANULARITY 1,
    INDEX idx_trace_id trace_id TYPE bloom_filter() GRANULARITY 1,
    INDEX idx_res_metadata_key mapKeys(metadata) TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_res_metadata_value mapValues(metadata) TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = ReplacingMergeTree(event_ts, is_deleted)
PARTITION BY toYYYYMM(start_time)
PRIMARY KEY (project_id, type, toDate(start_time))
ORDER BY (project_id, type, toDate(start_time), id)
SETTINGS index_granularity = 8192",)]

Asked ChatGPT

🚨 Why Queries Are Slow

Filtering on trace_id → Not in primary key, so query scans a ton of rows. Bloom filter reduces some work, but still expensive.

Filtering on start_time with toDate() → Function prevents index use. Better to use raw start_time.

ORDER BY event_ts → Not aligned with ORDER BY definition, so ClickHouse sorts after reading.

Wide SELECT lists (JSON/maps) → Even when only 1 row matches, lots of columns are decompressed from disk.

1 reply

dahnny012 Aug 22, 2025
Author

Tried a some suggestions from ChatGPT and not seeing much improvement.

# --- Query A: original ---
query_a = """
SELECT id, trace_id, project_id, environment, type, parent_observation_id,
       start_time, end_time, name, metadata, level, status_message, version,
       input, output, provided_model_name, internal_model_id, model_parameters,
       provided_usage_details, usage_details, provided_cost_details, cost_details,
       total_cost, completion_start_time, prompt_id, prompt_name, prompt_version,
       created_at, updated_at, event_ts
FROM observations
WHERE (id = '92ca832e0918f3ae')
  AND (project_id = 'cmdqcn9150008ad060i92351p')
  AND (toDate(start_time) = toDate(_CAST('1755824761.323', 'DateTime64(3)')))
  AND (trace_id = '99b47c2602a551ff5c3e78349997ce61')
ORDER BY event_ts DESC
LIMIT 1 BY id, project_id
FORMAT JSONEachRow
"""

# --- Query B: rewritten (range predicate + plain LIMIT) ---
query_b = """
WITH toDateTime64('2025-08-22 01:06:01.323', 3) AS t
SELECT id, trace_id, project_id, environment, type, parent_observation_id,
       start_time, end_time, name, metadata, level, status_message, version,
       input, output, provided_model_name, internal_model_id, model_parameters,
       provided_usage_details, usage_details, provided_cost_details, cost_details,
       total_cost, completion_start_time, prompt_id, prompt_name, prompt_version,
       created_at, updated_at, event_ts
FROM observations
WHERE project_id = 'cmdqcn9150008ad060i92351p'
  AND id = '92ca832e0918f3ae'
  AND trace_id = '99b47c2602a551ff5c3e78349997ce61'
  AND start_time >= toStartOfDay(t)
  AND start_time <  toStartOfDay(t) + INTERVAL 1 DAY
ORDER BY event_ts DESC
LIMIT 1
FORMAT JSONEachRow
"""

def benchmark(query, runs=10):
    times = []
    for i in range(runs):
        client.execute("SYSTEM DROP MARK CACHE")
        client.execute("SYSTEM DROP UNCOMPRESSED CACHE")
        start = time.perf_counter()
        rows = client.execute(query)
        elapsed = (time.perf_counter() - start) * 1000  # ms
        times.append(elapsed)
        # Optional: check rows to force evaluation
        _ = len(rows)
    return times

def summarize(name, times):
    print(f"\n{name} (runs={len(times)}):")
    print(f"  avg   = {statistics.mean(times):.2f} ms")
    print(f"  p50   = {statistics.median(times):.2f} ms")
    print(f"  p95   = {statistics.quantiles(times, n=100)[94]:.2f} ms")  # 95th percentile
    print(f"  min   = {min(times):.2f} ms, max = {max(times):.2f} ms")

# Run benchmarks
times_a = benchmark(query_a, runs=30)
times_b = benchmark(query_b, runs=30)

# Summarize
summarize("Query A (original)", times_a)
summarize("Query B (rewritten)", times_b)

I saw on the clickhouse benchmarks that EFS is quite slow (about 2x slower than ec2 with EBS) but im not sure if that will scale well with higher volume

jannikmaierhoefer · 2025-08-22T09:01:53Z

jannikmaierhoefer
Aug 22, 2025
Maintainer

Hi @dahnny012, we have docs on sizing and sclaing Langfuse here: https://langfuse.com/self-hosting/configuration/scaling

4 replies

dahnny012 Aug 22, 2025
Author

I have already gone through this as you can see from my instance sizes for clickhouse. Latency is fine in the beginning but even after a few 1000 traces it slows down to a crawl (about 256kb max per trace).

So latency grows exponentially with the amount of records which cannot be solved with instance sizes.

Steffen911 Sep 8, 2025
Maintainer

@dahnny012 AWS EFS offers specific throughput and IOPS modes. Could you ask a quota increase for your maximum IOPS to confirm whether this improves the situation? https://docs.aws.amazon.com/efs/latest/ug/performance.html

dahnny012 Sep 9, 2025
Author

With EFS, the throughput should be set to elastic or provisoned. My default CDK sets it to bursting. This is still slower than EBS but not notice ably so (150-300 ms vs 10-50ms) and is easier to manage.

Steffen911 Sep 10, 2025
Maintainer

@dahnny012 So does this put you into a situation where latencies are acceptable or are you still facing issues? If there are still problems, please summarize the current state of your setup as it's hard to follow the actual problematic configuration based on this thread.

dahnny012 · 2025-08-22T17:56:04Z

dahnny012
Aug 22, 2025
Author

Reaching out to @maxdeichmann because of https://langfuse.com/changelog/2024-02-29-performance-improvement-ui

0 replies

dahnny012 · 2025-08-25T17:11:12Z

dahnny012
Aug 25, 2025
Author

Running in a sagemaker notebook with the same specs but using EBS storage results in a 100x speed up.

Need to see how this scales over time.

0 replies

dahnny012 · 2025-08-27T23:31:46Z

dahnny012
Aug 27, 2025
Author

After transitioning the infra to use ec2 with EBS it is so much faster.

0 replies

dahnny012 · 2025-09-02T15:59:25Z

dahnny012
Sep 2, 2025
Author

Latency is in the 100-300ms range if you adjust it from EFS bursting to EFS Elastic.

7 replies

rjayaram-godaddy Nov 17, 2025

@Steffen911
Worker cpu/memory metrics look very under-utilized. Max utilization is around 10-15% for both memory and cpu
Worker ingestion queue is growing linearly at a very fast pace. It has around 7 lakh waiting jobs

{
  "queueName": "ingestion-queue",
  "jobCount": {
    "active": 20,
    "completed": 0,
    "delayed": 24,
    "failed": 24,
    "paused": 0,
    "prioritized": 0,
    "waiting": 743182,
    "waiting-children": 0
  }
}
{
  "queueName": "ingestion-queue",
  "jobCount": {
    "active": 20,
    "completed": 0,
    "delayed": 24,
    "failed": 24,
    "paused": 0,
    "prioritized": 0,
    "waiting": 743182,
    "waiting-children": 0
  }
}

rjayaram-godaddy Nov 17, 2025

I tried increasing the worker replicas, but that caused my UI to slow down even more. I even got a message that database is overloaded. But surprisingly the cpu/memory metrics for clickhouse never showed any anomaly
During the time when I had horizontally scaled the worker replicas to 4
I got these logs in clickhouse

2025.11.17 11:36:57.901333 [ 51063 ] {9e565802-46b8-43cd-807f-740d69361d57} <Error> DynamicQueryHandler: Code: 210. DB::NetException: I/O error: Broken pipe, while writing to socket (1). (NETWORK_ERROR), Stack trace (when copying this message, always include the lines below):
clickhouse

When I scaled it down to 1 worker, these errors vanished. Now the UI loads but very slow

Also some logs from worker. It is processing earlier events. For example events that were sent to ingestionqueue yesterday are successfully processed now but constantly I get these logs also in langfuse worker

2025-11-17T12:46:20.429Z error Queue job ingestion-job with id 15588556 in ingestion-queue failed Timeout error.
Error: Timeout error.

Could you please help me identify what is the cause of this bottleneck?

Steffen911 Nov 17, 2025
Maintainer

@rjayaram-godaddy Can you check our guide here on how to increase throughput in the ingestion pipeline? https://langfuse.com/self-hosting/configuration/scaling#ingestion-throughput. Especially the LANGFUSE_SKIP_INGESTION_CLICKHOUSE_READ_MIN_PROJECT_CREATE_DATE setting could be relevant for you to reduce load on ClickHouse and increase the number of messages that get processed.

rjayaram-godaddy Nov 17, 2025

Just one more observation @Steffen911 , I am using s3 bucket as disk for clickhouse and I can see that some log tables(text_log, blob_storage_log) are occupying lot of space. Is it possible that these tables are causing the slow UI loads and ingestion delays by any chance? I saw in langfuse docs that these tables can be deleted but just want to confirm with you once

Steffen911 Nov 17, 2025
Maintainer

@rjayaram-godaddy No, those tables should not impact insert or read performance of other tables in any way. Any changes you make to them shouldn't have any impact.

jamesgasek · 2025-10-17T19:27:00Z

jamesgasek
Oct 17, 2025

I also solved this by moving from EFS Bursting to EFS Enhanced -> Elastic

0 replies

Langfuse

Really slow latency for viewing traces within a timestamp and viewing a specific trace. #8675

Uh oh!

Uh oh!

dahnny012 Aug 21, 2025

Replies: 10 comments · 12 replies

Uh oh!

inkeep-ai-assistant[bot] bot Aug 21, 2025

Uh oh!

dahnny012 Aug 22, 2025 Author

Uh oh!

dahnny012 Aug 22, 2025 Author

Uh oh!

dahnny012 Aug 22, 2025 Author

Uh oh!

dahnny012 Aug 22, 2025 Author

Uh oh!

jannikmaierhoefer Aug 22, 2025 Maintainer

Uh oh!

dahnny012 Aug 22, 2025 Author

Uh oh!

Steffen911 Sep 8, 2025 Maintainer

Uh oh!

dahnny012 Sep 9, 2025 Author

Uh oh!

Steffen911 Sep 10, 2025 Maintainer

Uh oh!

dahnny012 Aug 22, 2025 Author

Uh oh!

dahnny012 Aug 25, 2025 Author

Uh oh!

dahnny012 Aug 27, 2025 Author

Uh oh!

dahnny012 Sep 2, 2025 Author

Uh oh!

Uh oh!

rjayaram-godaddy Nov 17, 2025

Uh oh!

rjayaram-godaddy Nov 17, 2025

Uh oh!

Steffen911 Nov 17, 2025 Maintainer

Uh oh!

Uh oh!

rjayaram-godaddy Nov 17, 2025

Uh oh!

Steffen911 Nov 17, 2025 Maintainer

Uh oh!

jamesgasek Oct 17, 2025

dahnny012
Aug 21, 2025

Replies: 10 comments 12 replies

inkeep-ai-assistant[bot]
bot Aug 21, 2025

dahnny012
Aug 22, 2025
Author

dahnny012
Aug 22, 2025
Author

dahnny012
Aug 22, 2025
Author

dahnny012 Aug 22, 2025
Author

jannikmaierhoefer
Aug 22, 2025
Maintainer

dahnny012 Aug 22, 2025
Author

Steffen911 Sep 8, 2025
Maintainer

dahnny012 Sep 9, 2025
Author

Steffen911 Sep 10, 2025
Maintainer

dahnny012
Aug 22, 2025
Author

dahnny012
Aug 25, 2025
Author

dahnny012
Aug 27, 2025
Author

dahnny012
Sep 2, 2025
Author

Steffen911 Nov 17, 2025
Maintainer

Steffen911 Nov 17, 2025
Maintainer

jamesgasek
Oct 17, 2025