Skip to content

[record_use] Implementation: A pool of References #3157

@dcharkes

Description

@dcharkes

Current State and Limitations

Currently, the record_use package models recorded usages by mapping a Definition to a list of Reference objects (e.g., CallWithArguments, InstanceCreationReference). Each Reference object contains both the semantic details of the usage (the receiver, the arguments) and the physical locations where that usage was observed (a list of LoadingUnits).

// Current Model (Simplified)
class Recordings {
  final Map<Definition, List<CallReference>> calls;
  final Map<Definition, List<InstanceReference>> instances;
}

sealed class CallReference {
  final List<LoadingUnit> loadingUnits; // The "Where"
  final MaybeConstant? receiver;        // The "What"
  // arguments, etc.
}

This architecture has several limitations:

  1. Inefficient Canonicalization and Merging: Because loadingUnits are part of the Reference object's identity (its operator == and hashCode), two semantically identical calls made from different loading units are treated as distinct, unequal objects. This makes it difficult to deduplicate identical calls and merge their loading unit lists.
  2. Redundant JSON Serialization: If a common function like print('hello') is called from 50 different loading units, the JSON output will repeat the call signature 50 times under the definition's uses list, each with a different loading unit.
  3. Inconsistent Nesting: InstanceConstantReference currently holds a pointer to its Definition (because it needs it to be self-describing when nested inside arguments), whereas top-level CallWithArguments objects do not hold their Definition (they rely on being values in the Recordings.calls map). This makes the Reference hierarchy irregular.

Proposed Architecture

To solve these issues, we propose fully normalizing the data model by decoupling the semantic content of a reference from its occurrence locations, and making references fully self-describing.

The new model introduces three distinct layers:

  1. Definition: The target of the reference (e.g., a specific method or class).
  2. Reference (The "What"): A purely semantic, value-typed object. It contains the Definition being referenced, the receiver, and the arguments. It knows nothing about where it was called. Because it is purely content-based, identical calls will hash to the same value and can be canonicalized into a single instance.
  3. ReferenceOccurrence (The "Where"): A new container object that pairs a canonicalized Reference with the set of LoadingUnits where it was observed.
// Proposed Model (Simplified)

// 1. The "What" (Fully self-contained, no locations)
sealed class Reference {
  final Definition definition;
}

final class CallWithArguments extends Reference {
  final MaybeConstant? receiver;
  final List<MaybeConstant> positionalArguments;
  // ...
}

// 2. The "Where"
class ReferenceOccurrence<T extends Reference> {
  final T reference;
  final Set<LoadingUnit> loadingUnits;
}

// 3. The Top-Level Container
class Recordings {
  final List<ReferenceOccurrence<CallReference>> calls;
  final List<ReferenceOccurrence<InstanceReference>> instances;
}

JSON Representation (Normalized Pools)

This architectural shift allows us to introduce a references pool in the JSON format, analogous to the existing constants and definitions pools.

{
  "loading_units": [ {"name": "lib1"}, {"name": "lib2"} ],
  "definitions": [ {"uri": "...", "path": ["print"]} ],
  "constants": [ {"type": "string", "value": "hello"} ],
  
  // NEW: A deduplicated pool of all distinct reference signatures
  "references": [
    {
      "type": "call_with_arguments",
      "definition_index": 0,
      "positional": [0] // points to "hello" constant
    }
  ],
  
  // UPDATED: 'uses' simply maps a canonical reference to its locations
  "uses": {
    "static_calls": [
      {
        "reference_index": 0,
        "loading_unit_indices": [0, 1] // Called from lib1 and lib2
      }
    ]
  }
}

Impact Analysis

Pros

  1. Trivial Deduplication: Canonicalizing the Recordings object becomes much simpler. You index Reference objects by their content. If a new occurrence has an identical Reference, you simply merge its loadingUnits into the existing ReferenceOccurrence.
  2. Consistent Domain Model: Every Reference is now self-contained (it knows its Definition) and acts as a true value type, resolving the inconsistency between top-level calls and nested instance constants.

I think the consistent domain model is the strongest argument here. Also in the various backends we first process all definitions, then we process all references. And only at the very end we figure out in which loading units those references end up.

Cons

  1. Increased Indirection: Navigating the object graph requires moving through an additional layer (Occurrence -> Reference -> Arguments).

We will definitely want to have a different user-facing API. At least some [] methods on Recordings or a completely different user-facing API. The implementation of this will have some lookup maps.

I'm not immediately planning to do this refactoring. I'm noting this as a domain model cleanup that would make other things easier. cc @goderbauer @biggs0125

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions