Skip to content

Conversation

@eyalleshem
Copy link
Contributor

@eyalleshem eyalleshem commented Dec 1, 2025

his PR adds a lifetime parameter to the Token enum to enable zero-copy tokenization in the future.

Backward Compatibility:
To maintain backward compatibility, we renamed Token to BorrowedToken<'a> and introduced Token as a type alias for BorrowedToken<'static>. This allows existing consumers of the library to continue using Token without needing to handle
lifetimes throughout their code.

New API:
This PR also adds a tokenized_owned() method for use cases where consumers prefer to pay the cost of copying in exchange for owned tokens.

Current State:
This commit does not yet change the tokenizer's behavior—all string allocations remain in place. The goal of the following commits is to replace String with Cow<'a, str> in as many places as possible, leveraging the Borrowed variant to
achieve zero-copy tokenization where feasible.

eyalleshem and others added 4 commits November 29, 2025 23:50
  Add internal _borrowed() functions that return Cow<\'a, str> to prepare for
  zero-copy tokenization. When the source string needs no transformation
  (no escaping), return Cow::Borrowed. When transformation is required,
  return Cow::Owned.

  The Token enum still uses String, so borrowed values are converted via
  to_owned() for now. This maintains API compatibility while preparing the
  codebase for a future refactor where Token can hold borrowed strings.

  Optimized: comments, quoted strings, dollar-quoted strings, quoted identifiers.
Key points for this commit:
- The peekable trait isn't sufficient for using string slices, as we need
  the byte indexes (start/end) to create string slices, so added the current
  byte position to the State struct
  (Note: in the long term we could potentially remove peekable and use only
  the current position as an iterator)
- Created internal functions that create slices from the original query
  instead of allocating strings, then converted these functions to return
  String to maintain compatibility (the idea is to make a small, reviewable
  commit without changing the Token struct or the parser)
Changed all parsing methods to take '&self' instead of '\&mut self'.
Mutable parser state (token index and parser state) now uses
for interior mutability.

This refactoring is preparation for the borrowed tokenizer work. When
holding borrowed tokens from the parser (with lifetime tied to '\&self'),
we cannot call methods requiring '\&mut self' due to Rust's borrowing
rules. Using interior mutability resolves this conflict by allowing
state mutations through shared references.
  This change introduces a lifetime parameter 'a to BorrowedToken enum
  to prepare for zero-copy tokenization support. This is a foundational
  step toward reducing memory allocations during SQL parsing.

  Changes:
  - Added lifetime parameter to BorrowedToken<'a> enum
  - Added _Phantom(Cow<'a, str>) variant to carry the lifetime
  - Implemented Visit and VisitMut traits for Cow<'a, str> to support
    the visitor pattern with the new lifetime parameter
  - Fixed lifetime issues in visitor tests by using tokenized_owned()
    instead of tokenize() where owned tokens are required
  - Type alias Token = BorrowedToken<'static> maintains backward
    compatibility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants