Description
Proposal
Problem statement
In Typst, we occasionally use the pattern below for determining if a string has exactly one character in it. This code is conceptually simple, but overcomplicated and confusing for beginners and deserves a solution in the standard library.
My first time reading this example I actually went to ChatGPT for an explanation and subsequently thought, "oh, I bet I can simplify this", but wasted my time as I found that the existing code is effectively optimal if using Rust combinator idioms.
Motivating examples or use cases
Simplified—but morally equivalent—code from the Typst parser:
fn math_class(text: &str) -> Option<MathClass> {
let mut chars = text.chars();
chars
.next()
.filter(|_| chars.next().is_none())
.and_then(unicode_math_class)
}
(Note: We can't write .filter(|_| chars.is_empty())
because Chars
doesn't implement ExactSizeIterator
.)
This function is used to determine a string's unicode math class, which is used by the parser to make a token act as an opening or closing delimiter in math text. For example, the math text $[0, infinity)$
creates a delimiter element for typesetting—which will grow to surround its contents—by matching the square bracket and parenthesis using their unicode class designations.
The unicode math class is only defined for single codepoints, so the code verifies that the string has at least one character by calling chars.next()
, then assures the string has at most one character by checking that chars.next()
(now the second element in the iterator) is None
.
Part of the complexity here comes from the need to create the mutable chars
variable to call .next()
twice, forcing the user to type two Rust statements for something conceptually simple. However, rewriting this as a single method chain gives the following options, and I think we can agree that these are more confusing than before:
// One of `fold` or `reduce` is necessary to reason about two chars together.
// With `reduce`:
fn math_class(text: &str) -> Option<MathClass> {
text
.chars()
.take(2) // avoid iterating over more than two items in fold/reduce
.map(Some)
.reduce(|_, _| None)
.flatten()
.and_then(unicode_math_class)
}
// With `fold`:
fn math_class(text: &str) -> Option<MathClass> {
text
.chars()
.take(2)
.fold(None, |prev, c| prev.xor(Some(c)))
.and_then(unicode_math_class)
}
This is not the only example, and the need for this pattern pops up throughout the compiler when we operate on strings that may have typesetting properties unique to single codepoints.
Also see the existing uses of itertools::exactly_one
below.
Solution sketch
My preferred solution would be to add a method: Chars::single(&self) -> Option<char>
which would allow the following code:
fn math_class(text: &str) -> Option<MathClass> {
text
.chars()
.single()
.and_then(unicode_math_class)
}
I've written a draft implementation of this and would like to publish it as a PR if decided on.
My draft implementation
// library/core/src/std/iter.rs::8
use crate::char::MAX_LEN_UTF8;
// library/core/src/std/iter.rs::138
impl<'a> Chars<'a> {
/// ...
pub fn as_str(&self) -> &'a str { ... }
/// Returns a single [`char`] if exactly one is present in the iterator.
///
/// Returns `None` both when the iterator is empty and when the iterator has
/// more than one element.
///
/// # Examples
/// ```
/// #![feature(chars_single)]
///
/// assert_eq!("".chars().single(), None);
/// assert_eq!("1".chars().single(), Some('1'));
/// assert_eq!("12".chars().single(), None);
/// ```
#[unstable(feature = "chars_single", issue = "none")] // TODO: Add issue
#[must_use]
#[inline]
pub fn single(&self) -> Option<char> {
if self.iter.len() > MAX_LEN_UTF8 {
// I have not measured it, but this early return should be fine to
// keep and can be removed later if actually slow.
return None;
}
let mut dup = self.clone();
let first = dup.next();
if dup.iter.is_empty() { first } else { None }
}
}
Here is my rationale for some of the minutiae of this method, before discussing the main alternatives:
-
Why a method on
Chars
as opposed to a method onstr
, e.g.str::single_char(&self) -> Option<char>
?Chars
is already where you go to get individual characters generally. Putting it onstr
makes them compete for attention. Adding a method toChars
centralizes this need making it more discoverable.Chars
is inherently plural, so the namesingle
is unambiguous that it finds whether there is just one char.
-
Why not return
Result<char, Option<(char, char)>>
(or an isomorphic enum)?- We don't need to call
.next()
twice to check if there is exactly one character. The internal iterator ofChars
implementsExactSizeIterator
, so we can check.is_empty()
after the first character. See my draft implementation. - The use of the existing pattern in Typst is to distinguish whether a string contains exactly one character for typesetting or for checking a property defined on single codepoints. In typesetting, strings with multiple characters generally need to be handled as grapheme clusters of potentially more than two characters, so returning two characters in the
Err
variant isn't helpful. And for properties defined on single characters, there's no meaningful value if the string has more than one character, soNone
is expected. - Making use of a
Result
isn't relevant for the combinator methods in the motivating example. To actually unpack the values, you would need to match on the result, which would be creating a full statement, leaving the world of combinator methods. This would be similarly ergonomic tolet mut chars = ...; match (chars.next(), chars.next()) { ... }
.
- We don't need to call
Alternatives
The major alternative would be a method on Iterator
. Indeed, Iterator::single
was already proposed and turned down because of uncertainty in the exact API guarantees and its need to be present on such a core trait.
Instead, the method was added to itertools as exactly_one(self) -> Result<Self::Item, ExactlyOneError>
.
Two notable comments on the PR:
-
I'm really not a fan of making "0 elements" and "more than one element" indistinguishable. I feel like the whole point of single is that it's a coding error for there to be multiple -- if it's going to be handled, one would just use .next() instead.
-
My personal gut reaction to this method is that it's not worth it to add to the standard library, so I don't really have many opinions about the precise pieces here. If signatures like
Result<T, Option<[T; 2]>>
are being considered though that sounds like this belongs not in libstd.
I think these API critiques in the issue are fair: In the general iterator case, you will need to call .next()
two times to check for there being exactly one item, and doing so may or may not consume resources or cause side-effects. Leaving this choice to user code or external libraries may be for the best when we cannot make guarantees against side-effects. However, we can make these guarantees in the specific case of Chars
.
Existing uses of .chars().exactly_one()
While the itertools method already exists, I still think it's worth moving into the standard library.
To show its utility, I did a cursory search of existing uses of .chars().exactly_one()
on GitHub and, filtering for unique examples, got the following 25 files.
File links
Advent of Code (16)
- https://github.com/ArhanChaudhary/advent-of-code/blob/main/2023/day-18/src/bin/part1.rs
- https://github.com/citizenmatt/Advent-of-Code/blob/master/advent-of-code-2020/src/day2.rs
- https://github.com/danielhuang/aoc-2022/blob/master/src/bin/2.rs
- https://github.com/iTitus/aoc2023/blob/main/src/day18.rs
- https://github.com/leo60228/aoc2021/blob/main/src/bin/d14s1.rs
- https://github.com/pavadeli/aoc-2024/blob/master/day24/src/circuit.rs
- https://github.com/tocklime/aoc-rs/blob/master/aoc/src/solutions/y2019/day03.rs
- https://github.com/plaflamme/aoc-2021-rs/blob/master/aoc_2021/src/day14.rs
- https://github.com/ngc0202/AdventOfCode/blob/master/src/y2022/day2.rs
- https://github.com/bm-w/advent20-rs/blob/main/src/day19.rs
- https://github.com/rust-tw/advent-of-code/blob/main/2022/05/parser_practice/src/main.rs
- https://github.com/unlimitedsola/advent-of-code/blob/main/src/y2023/day18.rs
- https://github.com/memark/advent-of-code/blob/main/2022/day-21/src/main.rs
- https://github.com/JungPhilipp/AdventOfCode/blob/master/src/problems_2022/day21/mod.rs
- https://github.com/c-weis/rusty-advent-2024/blob/main/src/bin/day24.rs
- https://github.com/bm-w/advent22-rs/blob/main/src/day11.rs
Compilers/Interpreters (5)
- https://github.com/RustPython/RustPython/blob/main/vm/src/stdlib/msvcrt.rs
- https://github.com/RustPython/RustPython/blob/main/stdlib/src/array.rs
- https://github.com/lvyitian/RustPython_x86_32win_build/blob/master/vm/src/stdlib/unicodedata.rs
- https://github.com/Carlton-Perkins/COMP4060-Compiler/blob/master/src/common/types.rs
- https://github.com/circles-png/li/blob/main/src/main.rs
Misc (4)
- https://github.com/1Computer1/kanjidle/blob/main/server/src/data.rs
- https://github.com/rollrat/youtube-dl-rs/blob/main/vm/src/cformat.rs
- https://github.com/k9withabone/autocast/blob/main/src/config/de/key.rs
- https://github.com/HactarCE/Hyperspeedcube/blob/main/crates/hyperpuzzle_lua/src/lua/types/wrappers/vector_index.rs
Of these 25 files:
- 11 immediately
.unwrap()
the result (mostly advent of code) - 12 map a generic error or otherwise return without using the
Err
variant at all. - 1 uses the
Err
variant at all, but has anallow(dead_code)
annotation on the resulting error type.
Disclaimers: this is only relevant for Chars
, this excludes private and non-github repositories, and I only did a cursory, incomplete search on this one string.
However, I think these examples provide a strong argument for this method's inclusion in the standard library, and in particular for the variant with an Option
return-type being available on Chars
.
Links and related work
- The full motivating example in Typst: https://github.com/typst/typst/blob/3e6691a93bd8c2947bd22e3c7344c2fab7d1241f/crates/typst-syntax/src/parser.rs#L455-L472
- The original
Iterator::single
forum discussion: https://internals.rust-lang.org/t/what-do-you-think-about-iterator-single/8608/3 - The subsequent stdlib pull request: Add Iterator::single rust#55355
- Another forum post on this: https://users.rust-lang.org/t/ensuring-an-iterator-yields-only-one-item/120595
- The initial pull request adding
Itertools::exactly_one
: Add exactly_one function rust-itertools/itertools#310
Note that the accepted String::into_chars
ACP has some discussion around changing where Chars
is defined and will likely conflict in git when the two changes merge.
Unchanged "What happens now?" and "Possible responses" sections
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.
Second, if there's a concrete solution:
- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.