|
| 1 | +# TSDB Index Lookup Planning |
| 2 | + |
| 3 | +* **Owners:** |
| 4 | + * `@dimitarvdimitrov` |
| 5 | + |
| 6 | +* **Implementation Status:** `Partially implemented` |
| 7 | + |
| 8 | +* **Related Issues and PRs:** |
| 9 | + * [PR #16835: tsdb index: introduce scan matchers](https://github.com/prometheus/prometheus/pull/16835) |
| 10 | + * [Mimir issue #11916: TSDB index lookup planning](https://github.com/grafana/mimir/issues/11916) |
| 11 | + |
| 12 | +* **Other docs or links:** |
| 13 | + * [Store-gateway optimization blog post](https://grafana.com/blog/2023/08/21/less-is-more-how-grafana-mimir-queries-run-faster-and-more-cost-efficiently-with-fewer-indexes/) |
| 14 | + * [Prometheus fast regexp label matcher](https://github.com/grafana/mimir-prometheus/blob/main/model/labels/regexp.go) |
| 15 | + * [Access Path Selection in a Relational Database Management System](https://15799.courses.cs.cmu.edu/spring2025/papers/02-systemr/selinger-sigmod1979.pdf) |
| 16 | + |
| 17 | +> TL;DR: This proposal introduces extension points for TSDB index lookups that allow different execution strategies to address the problem of inefficient index lookup usage. The goal is to provide interfaces that enable downstream projects to implement custom optimization approaches for their specific use cases. |
| 18 | +
|
| 19 | +## Why |
| 20 | + |
| 21 | +Prometheus' current index lookup approach creates performance bottlenecks in high-cardinality environments. Two major inefficiencies exist: |
| 22 | + |
| 23 | +1. **Broad matcher inefficiency**: Wide matchers like `namespace != ""` select massive numbers of series, creating significant memory overhead for minimal filtering benefit |
| 24 | +2. **Expensive regex evaluation**: Non-optimizable regex matchers against high-cardinality labels create CPU bottlenecks |
| 25 | + |
| 26 | +Real-world profiling across high-cardinality Mimir deployments shows 34% of CPU time spent on string matching and 20% on posting list iteration. These patterns appear consistently in high-cardinality environments and significantly affect total cost of ownership. |
| 27 | + |
| 28 | +### Pitfalls of the current solution |
| 29 | + |
| 30 | +The current naive approach to index lookups has specific problems: |
| 31 | + |
| 32 | +**Example 1: Broad matcher inefficiency** |
| 33 | +- Query with 5 matchers, including `namespace != ""` |
| 34 | +- Selects union of all series with any namespace value |
| 35 | +- In a 2M series block: 2M series × 8 bytes = 16MB (roughly the equivalent of 16,000 XOR chunks) |
| 36 | +- Other matchers (`job`, `pod`, `container`, metric name) are typically more selective |
| 37 | +- Results in massive memory overhead for minimal filtering benefit |
| 38 | + |
| 39 | +**Example 2: Expensive regex evaluation** |
| 40 | +- Single TSDB block: 1.8M series |
| 41 | +- One label with 220,000 distinct values |
| 42 | +- Non-optimizable regex against high-cardinality label |
| 43 | +- Runs regex against 200K values to select 2-10 series |
| 44 | +- Shows up as double-digit CPU percentage in profiles with massive allocation impact |
| 45 | + |
| 46 | +## Goals |
| 47 | + |
| 48 | +* Provide extension points for TSDB index lookups that allow alternative execution strategies |
| 49 | +* Enable downstream projects to implement custom optimization approaches for their specific use cases |
| 50 | +* Support experimentation with different planning algorithms and storage characteristics |
| 51 | +* Allow flexibility in addressing index lookup inefficiencies without changing core TSDB behavior |
| 52 | + |
| 53 | +### Audience |
| 54 | + |
| 55 | +This change primarily targets: |
| 56 | +- High-cardinality Prometheus deployments (>1M series) |
| 57 | +- Downstream projects like Mimir, Thanos, and Cortex that need different optimization strategies |
| 58 | + |
| 59 | +## Non-Goals |
| 60 | + |
| 61 | +* Replace existing regex optimizations |
| 62 | +* Change the core TSDB storage format |
| 63 | +* Provide immediate performance improvements without statistics collection |
| 64 | +* Improve `/api/v1/labels` and `/api/v1/label/{}/values` requests |
| 65 | + |
| 66 | +## How |
| 67 | + |
| 68 | +### Core Approach |
| 69 | + |
| 70 | +Building on the scan matchers foundation from [PR #16835](https://github.com/prometheus/prometheus/pull/16835), this proposal introduces a planning phase that: |
| 71 | + |
| 72 | +1. Allows different execution strategies for each query |
| 73 | +2. Partitions matchers into index-resolved vs series-resolved categories |
| 74 | +3. Executes with lazy evaluation according to the chosen plan |
| 75 | + |
| 76 | +The approach mirrors techniques used by database query planners when choosing between index scans and sequential scans. |
| 77 | + |
| 78 | +### Interface Design |
| 79 | + |
| 80 | +Introduce core planning interfaces that allow downstream projects to implement their own strategies: |
| 81 | + |
| 82 | +```go |
| 83 | +// LookupPlanner plans how to execute index lookups by deciding which matchers |
| 84 | +// to apply during index lookup versus after series retrieval. |
| 85 | +type LookupPlanner interface { |
| 86 | + PlanIndexLookup(ctx context.Context, plan LookupPlan, minT, maxT int64) (LookupPlan, error) |
| 87 | +} |
| 88 | + |
| 89 | +// LookupPlan represents the decision of which matchers to apply during |
| 90 | +// index lookup versus during series scanning. |
| 91 | +type LookupPlan interface { |
| 92 | + // ScanMatchers returns matchers that should be applied during series scanning |
| 93 | + ScanMatchers() []*labels.Matcher |
| 94 | + // IndexMatchers returns matchers that should be applied during index lookup |
| 95 | + IndexMatchers() []*labels.Matcher |
| 96 | +} |
| 97 | +``` |
| 98 | + |
| 99 | +### Simple Rule-Based Implementation |
| 100 | + |
| 101 | +As a concrete example, [PR #16835](https://github.com/prometheus/prometheus/pull/16835) introduces a `ScanEmptyMatchersLookupPlanner` that implements a simple rule-based approach. This planner identifies matchers that are expensive to apply on the inverted index and usually don't filter any data, deferring them to scan matchers instead. |
| 102 | + |
| 103 | +The rules are: |
| 104 | +- `{label=""}` - converted to scan matcher (expensive index lookup, minimal filtering) |
| 105 | +- `{label=~".+"}` - converted to scan matcher (expensive regex evaluation, broad selection) |
| 106 | +- `{label=~".*"}` - removed entirely (matches everything, including unset values) |
| 107 | + |
| 108 | +This demonstrates how the interface can be used to implement straightforward optimizations without requiring complex cost models or statistics collection. Such simple rule-based planners can provide immediate benefits for well-understood inefficient patterns while serving as building blocks for more sophisticated approaches. |
| 109 | + |
| 110 | +## Alternatives |
| 111 | + |
| 112 | +1. **Improve existing regex optimizations**: Continue optimizing the current approach with better regex compilation and caching. This approach has diminishing returns and doesn't address broad matcher inefficiency. |
| 113 | + |
| 114 | +2. **Always use sequential scans**: Skip index lookups entirely and scan all series. This could be simpler but would hurt performance for selective queries. |
| 115 | + |
| 116 | +3. **Static rule-based approach**: Use fixed rules instead of cost-based planning. This would be simpler to implement but usually misses the nuances of a cost model with cardinality estimations. However, the current `PostingsForMatchers` implementation already has some of these heuristics which always work. |
| 117 | + |
| 118 | +The proposed approach provides the flexibility to adapt to different workload characteristics while maintaining compatibility with existing optimizations. |
| 119 | + |
| 120 | +## Action Plan |
| 121 | + |
| 122 | +* [ ] Add scan matchers to querier code |
| 123 | +* [ ] Implement basic `LookupPlanner` interface with simple heuristics |
| 124 | +* [ ] Validate approach with real-world high-cardinality workloads |
0 commit comments