Skip to content

Add Chars::single(&self) -> Option<char> for getting exactly one character #576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wrzian opened this issue Apr 20, 2025 · 5 comments
Closed
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@wrzian
Copy link

wrzian commented Apr 20, 2025

Proposal

Problem statement

In Typst, we occasionally use the pattern below for determining if a string has exactly one character in it. This code is conceptually simple, but overcomplicated and confusing for beginners and deserves a solution in the standard library.

My first time reading this example I actually went to ChatGPT for an explanation and subsequently thought, "oh, I bet I can simplify this", but wasted my time as I found that the existing code is effectively optimal if using Rust combinator idioms.

Motivating examples or use cases

Simplified—but morally equivalent—code from the Typst parser:

fn math_class(text: &str) -> Option<MathClass> {
    let mut chars = text.chars();
    chars
        .next()
        .filter(|_| chars.next().is_none())
        .and_then(unicode_math_class)
}

(Note: We can't write .filter(|_| chars.is_empty()) because Chars doesn't implement ExactSizeIterator.)

This function is used to determine a string's unicode math class, which is used by the parser to make a token act as an opening or closing delimiter in math text. For example, the math text $[0, infinity)$ creates a delimiter element for typesetting—which will grow to surround its contents—by matching the square bracket and parenthesis using their unicode class designations.

The unicode math class is only defined for single codepoints, so the code verifies that the string has at least one character by calling chars.next(), then assures the string has at most one character by checking that chars.next() (now the second element in the iterator) is None.

Part of the complexity here comes from the need to create the mutable chars variable to call .next() twice, forcing the user to type two Rust statements for something conceptually simple. However, rewriting this as a single method chain gives the following options, and I think we can agree that these are more confusing than before:

// One of `fold` or `reduce` is necessary to reason about two chars together.
// With `reduce`:
fn math_class(text: &str) -> Option<MathClass> {
    text
        .chars()
        .take(2) // avoid iterating over more than two items in fold/reduce
        .map(Some)
        .reduce(|_, _| None)
        .flatten()
        .and_then(unicode_math_class)
}
// With `fold`:
fn math_class(text: &str) -> Option<MathClass> {
    text
        .chars()
        .take(2)
        .fold(None, |prev, c| prev.xor(Some(c)))
        .and_then(unicode_math_class)
}

This is not the only example, and the need for this pattern pops up throughout the compiler when we operate on strings that may have typesetting properties unique to single codepoints.

Also see the existing uses of itertools::exactly_one below.

Solution sketch

My preferred solution would be to add a method: Chars::single(&self) -> Option<char> which would allow the following code:

fn math_class(text: &str) -> Option<MathClass> {
    text
        .chars()
        .single()
        .and_then(unicode_math_class)
}

I've written a draft implementation of this and would like to publish it as a PR if decided on.

My draft implementation
// library/core/src/std/iter.rs::8
use crate::char::MAX_LEN_UTF8;

// library/core/src/std/iter.rs::138
impl<'a> Chars<'a> {
    /// ...
    pub fn as_str(&self) -> &'a str { ... }

    /// Returns a single [`char`] if exactly one is present in the iterator.
    ///
    /// Returns `None` both when the iterator is empty and when the iterator has
    /// more than one element.
    ///
    /// # Examples
    /// ```
    /// #![feature(chars_single)]
    ///
    /// assert_eq!("".chars().single(), None);
    /// assert_eq!("1".chars().single(), Some('1'));
    /// assert_eq!("12".chars().single(), None);
    /// ```
    #[unstable(feature = "chars_single", issue = "none")] // TODO: Add issue
    #[must_use]
    #[inline]
    pub fn single(&self) -> Option<char> {
        if self.iter.len() > MAX_LEN_UTF8 {
            // I have not measured it, but this early return should be fine to
            // keep and can be removed later if actually slow.
            return None;
        }
        let mut dup = self.clone();
        let first = dup.next();
        if dup.iter.is_empty() { first } else { None }
    }
}

Here is my rationale for some of the minutiae of this method, before discussing the main alternatives:

  • Why a method on Chars as opposed to a method on str, e.g. str::single_char(&self) -> Option<char>?

    1. Chars is already where you go to get individual characters generally. Putting it on str makes them compete for attention. Adding a method to Chars centralizes this need making it more discoverable.
    2. Chars is inherently plural, so the name single is unambiguous that it finds whether there is just one char.
  • Why not return Result<char, Option<(char, char)>> (or an isomorphic enum)?

    1. We don't need to call .next() twice to check if there is exactly one character. The internal iterator of Chars implements ExactSizeIterator, so we can check .is_empty() after the first character. See my draft implementation.
    2. The use of the existing pattern in Typst is to distinguish whether a string contains exactly one character for typesetting or for checking a property defined on single codepoints. In typesetting, strings with multiple characters generally need to be handled as grapheme clusters of potentially more than two characters, so returning two characters in the Err variant isn't helpful. And for properties defined on single characters, there's no meaningful value if the string has more than one character, so None is expected.
    3. Making use of a Result isn't relevant for the combinator methods in the motivating example. To actually unpack the values, you would need to match on the result, which would be creating a full statement, leaving the world of combinator methods. This would be similarly ergonomic to let mut chars = ...; match (chars.next(), chars.next()) { ... }.

Alternatives

The major alternative would be a method on Iterator. Indeed, Iterator::single was already proposed and turned down because of uncertainty in the exact API guarantees and its need to be present on such a core trait.

Instead, the method was added to itertools as exactly_one(self) -> Result<Self::Item, ExactlyOneError>.

Two notable comments on the PR:

  • From scottmcm:

    I'm really not a fan of making "0 elements" and "more than one element" indistinguishable. I feel like the whole point of single is that it's a coding error for there to be multiple -- if it's going to be handled, one would just use .next() instead.

  • From alexcrichton:

    My personal gut reaction to this method is that it's not worth it to add to the standard library, so I don't really have many opinions about the precise pieces here. If signatures like Result<T, Option<[T; 2]>> are being considered though that sounds like this belongs not in libstd.

I think these API critiques in the issue are fair: In the general iterator case, you will need to call .next() two times to check for there being exactly one item, and doing so may or may not consume resources or cause side-effects. Leaving this choice to user code or external libraries may be for the best when we cannot make guarantees against side-effects. However, we can make these guarantees in the specific case of Chars.

Existing uses of .chars().exactly_one()

While the itertools method already exists, I still think it's worth moving into the standard library.

To show its utility, I did a cursory search of existing uses of .chars().exactly_one() on GitHub and, filtering for unique examples, got the following 25 files.

File links

Advent of Code (16)

Compilers/Interpreters (5)

Misc (4)

Of these 25 files:

  • 11 immediately .unwrap() the result (mostly advent of code)
  • 12 map a generic error or otherwise return without using the Err variant at all.
  • 1 uses the Err variant at all, but has an allow(dead_code) annotation on the resulting error type.

Disclaimers: this is only relevant for Chars, this excludes private and non-github repositories, and I only did a cursory, incomplete search on this one string.

However, I think these examples provide a strong argument for this method's inclusion in the standard library, and in particular for the variant with an Option return-type being available on Chars.

Links and related work

Note that the accepted String::into_chars ACP has some discussion around changing where Chars is defined and will likely conflict in git when the two changes merge.


Unchanged "What happens now?" and "Possible responses" sections

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@wrzian wrzian added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Apr 20, 2025
@kennytm
Copy link
Member

kennytm commented Apr 21, 2025

i'd prefer basing this on rust-lang/rust#81615 and do something like text.chars().collect_array_exact::<1>()?[0]

@programmerjake
Copy link
Member

one other option that only works for this task is:

    let Some((0, ch)) = s.char_indices().last() else {
        todo!();
    };

@wrzian
Copy link
Author

wrzian commented Apr 21, 2025

These are both better options than we have. And .char_indices().last() is a nice insight to avoid calling .next() twice (collect_array_exact still would). I might propose updating to char_indices in Typst today since it's strictly more efficient. However, I'm still unhappy with how complex these are for the simplicity of the task, especially in the combinator case.

text
    .chars()
    .collect_array_exact::<1>()
    .get(0)
    .and_then(unicode_math_class)
// -----
text
    .char_indices()
    .last()
    .and_then(|(idx, ch)| (idx == 0).then_some(ch))
    .and_then(unicode_math_class)

My problem is that these have too many affordances: there are too many places where specific details matter. You can't assure yourself that the usage is correct without fully reading it.

My heuristic is whether I would expect a beginner to want to add a comment clarifying what the code does for their future self, and I think only .chars().single() passes.

@hanna-kruppe
Copy link

My problem is that these have too many affordances: there are too many places where specific details matter. You can't assure yourself that the usage is correct without fully reading it.

My heuristic is whether I would expect a beginner to want to add a comment clarifying what the code does for their future self, and I think only .chars().single() passes.

These are very good reasons to abstract it into a function with a clear name, but it doesn’t necessarily have to be in the standard library. For example, you can add a single() method to Chars with an extension trait. If you’re worried about the future compatibility risks of that (low in this specific case IMO), a free function like fn single_char(s: &str) -> Option<char> may also be fine.

@Amanieu
Copy link
Member

Amanieu commented Apr 22, 2025

We discussed this in the @rust-lang/libs-api meeting and concluded that we would prefer having this functionality directly on the Iterator trait itself, which would make this available for other iterator types.

Several options are possible:

In the discussion we were in favor of accepting exactly_one as is, but we are also open to other alternatives if they seem better.

@Amanieu Amanieu closed this as not planned Won't fix, can't repro, duplicate, stale Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

5 participants