Skip to content

API/BUG: .loc tuple ambiguity with MultiIndex when nlevels == ndim #65326

@jbrockmendel

Description

@jbrockmendel

I [the human] asked Claude to post the following:

Summary

When a DataFrame has a MultiIndex with nlevels == ndim (the most common case: a 2-level MI on a 2D DataFrame), tuple keys passed to .loc are fundamentally ambiguous between two interpretations:

  1. MI row key: df.loc[(a, b)] means "the row with MultiIndex key (a, b)"
  2. Multi-axis indexing: df.loc[a, b] means "row a, column b"

Python produces the same tuple (a, b) for both spellings. The current code resolves this ambiguity differently depending on context — getitem vs setitem, existing vs missing key, scalar vs slice — leading to a cluster of related bugs where the same syntax silently produces different results.

Related issues

Primary — same root ambiguity

Issue Status Problem
#14969 Open df.loc[0, 0] returns a DataFrame or a Series depending on the dtype of the inner MI level — identical syntax, different interpretation
#16396 Open df.loc[1, 2] uses the MI interpretation, but df.loc[:1, 2] and df.loc[:, 2] switch to multi-axis (column 2) — incoherent
#27248 Open df.loc[(1, 2019)] = [3, 4] fails on setitem-with-expansion because the MI key (1, 2019) is misinterpreted as (row=1, col=2019)
#42603 Open df.loc["foo", 0] silently returns different things depending on whether 0 exists in MI level 1 — proposes an ambiguity error
#19110 Open df.loc[existing_row, new_col] = val adds a column instead of a (partial) row — missing-label priority is inconsistent with present-label priority
#17024 Open df.loc['all'] = [5, 6] on a MI DataFrame flattens the MI to tupled strings

See also

Issue Status Notes
#39775 Open KeyError semantics for partially-missing MI keys in slices — adjacent problem about what should happen when MI keys don't exist
#16018 Closed cannot reindex from duplicate axis on MI expansion (fixed in 1.3)
#22247 Closed MI expansion with NaN level copies wrong values (fixed)

Current behavior

The resolution rule depends on context

The current code applies different heuristics depending on the operation and key type:

getitem with scalar tuple (_getitem_lowerdim_handle_lowerdim_multi_index_axis0):

  • Try MI key via obj.xs(tup). If found → return row.
  • If KeyError and ndim < len(tup) <= nlevels → re-raise (MI interpretation).
  • If KeyError and len(tup) == nlevels == ndim → raise IndexingError, which is suppressed by the caller so the per-axis loop handles it as multi-axis.

This means getitem silently switches interpretation when a MI key is missing:

mi = pd.MultiIndex.from_tuples([(1, 2), (3, 4)])
df = pd.DataFrame([[10, 20], [30, 40]], index=mi, columns=[100, 200])

df.loc[(1, 2)]    # → Series (MI row key found)
df.loc[(1, 100)]  # → Series (MI key not found → silently becomes row=1, col=100)

getitem with slices (_getitem_tuple_same_dim):

setitem with scalar tuple (_get_setitem_indexer):

The result depends on dtype

Because the fallback to multi-axis only triggers when the MI lookup fails, and whether it finds a match depends on what values are in the MI levels, the same syntax gives different results depending on dtype (#14969):

# String inner level → (0, 0) is NOT in the MI (level 1 is strings),
# so MI lookup fails and multi-axis takes over: partial level-0 key
# on both axes yields a DataFrame.
ind = pd.MultiIndex.from_product([[0, 1], ['A', 'B']])
df = pd.DataFrame(..., index=ind, columns=ind)
df.loc[0, 0]  # → DataFrame

# Int inner level → (0, 0) IS in the MI, so MI row lookup succeeds
# and returns that single row as a Series (indexed by the column MI).
ind = pd.MultiIndex.from_product([[0, 1], [0, 1]])
df = pd.DataFrame(..., index=ind, columns=ind)
df.loc[0, 0]  # → Series

Cases that already work

The ambiguity only exists when nlevels == ndim. These cases work correctly:

  • Series with MI (ndim=1): any tuple with len > 1 exceeds ndim, so _convert_tuple raises IndexingError("Too many indexers") and the fallthrough correctly handles MI keys.
  • 3+-level MI on DataFrame (nlevels > ndim=2): same mechanism — _convert_tuple rejects the tuple and the fallthrough handles it.

Why heuristics don't work

I explored several heuristic approaches to disambiguate setitem when nlevels == ndim:

  1. Check if last element is an existing column: Fails for df.loc[partial_key, new_col] = val (simultaneous row + column expansion).
  2. Check type compatibility of each element with its MI level: Fails when MI levels and columns share the same dtype (e.g., int/int MI + int columns).
  3. Combined type + column-dtype check: Handles many cases but is fundamentally a guess — there exist DataFrames where both interpretations are type-valid.

The ambiguity is inherent to the API, not a deficiency in the implementation.

The right API

I think the right long-term answer is: .loc should not guess. When nlevels == ndim, there should be one canonical interpretation, and the user should use explicit syntax for the other.

Option A: MI key always wins

.loc[(a, b)] on a 2-level MI always means MI row key. Multi-axis indexing requires the explicit df.loc[(a, b), :] form (or partial keys like df.loc[a, col] when a is clearly a single level-0 value).

Pros: Consistent with getitem priority (MI key is tried first). Consistent with Series. The multi-axis form df.loc[..., :] is explicit and readable.

Cons: Behavior change for df.loc[scalar, col] when the scalar matches level 0 and col matches a column.

Option B: Multi-axis always wins

.loc[(a, b)] on a 2-level MI always means (row=a, col=b). MI row key indexing requires df.loc[pd.IndexSlice[a, b]] or df.loc[(a, b), :].

Pros: No behavior change for existing multi-axis code. Consistent with how _convert_tuple already works.

Cons: Breaks df.loc[(a, b)] for MI row access, which is arguably the more natural reading. Inconsistent with Series and with the nlevels > ndim case.

Option C: Raise on ambiguity

When nlevels == ndim and the key is a plain tuple, raise an error (as suggested in #42603) requiring the user to be explicit. Problem: Python offers no way to distinguish df.loc[(a, b)] from df.loc[a, b], so we'd have to reject all plain tuple keys when nlevels == ndim, which is extremely disruptive.

My recommendation: Option A

The key observation is that the MI-key interpretation is already the first thing tried in both getitem and setitem — the multi-axis path is always a fallback. Making the MI-key interpretation authoritative (rather than a try/except heuristic) would make the behavior predictable and consistent across getitem, setitem, scalars, and slices.

Users who need multi-axis indexing on a 2-level MI DataFrame can write df.loc[(a, b), :] or use pd.IndexSlice.

How to get there

  1. Short term: Document the ambiguity and the df.loc[(a, b), :] workaround in the MultiIndex docs.
  2. Medium term: Add a FutureWarning when the ambiguous case is hit — when nlevels == ndim, MI key lookup fails, and the code is about to fall back to multi-axis. The warning should suggest explicit syntax.
  3. Long term: Change the default to MI key interpretation when nlevels == ndim, removing the multi-axis fallback.

This would resolve #27248, #19110, #16396, #42603, and #14969. #17024 (MI flattening on scalar expansion) is a separate sub-bug in the expansion machinery.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignIndexingRelated to indexing on series/frames, not to indexes themselvesMultiIndex

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions