- Proposal: SE-0351
- Authors: Richard Wei, Michael Ilseman, Nate Cook, Alejandro Alonso
- Review Manager: Ben Cohen
- Implementation: apple/swift-experimental-string-processing
- Available in nightly toolchain snapshots with
import _StringProcessing
- Available in nightly toolchain snapshots with
- Status: Implemented (Swift 5.7)
- Review: (pitch) (first review) (revision) (second review) (acceptance)
Table of Contents
- Regex builder DSL
Declarative string processing aims to offer powerful pattern matching capabilities with expressivity, clarity, type safety, and ease of use. To achieve this, we propose to introduce a result-builder-based DSL, regex builder, for creating and composing regular expressions (regexes).
Regex builder is part of the Swift Standard Library but resides in a standalone module named RegexBuilder
. By importing RegexBuilder
, you get all necessary API for building a regex.
import RegexBuilder
let emailPattern = Regex {
let word = OneOrMore(.word)
Capture {
ZeroOrMore {
word
"."
}
word
}
"@"
Capture {
word
OneOrMore {
"."
word
}
}
} // => Regex<(Substring, Substring, Substring)>
let email = "My email is [email protected]."
if let match = try emailPattern.firstMatch(in: email) {
let (wholeMatch, name, domain) = match.output
// wholeMatch: "[email protected]"
// name: "my.name"
// domain: "mail.swift.org"
}
This proposal introduces all core API for creating and composing regexes that echos the textual regex syntax and strongly typed regex captures, but does not formally specify the matching semantics or define character classes.
Regex is a fundamental and powerful tool for textual pattern matching. It is a domain-specific language often expressed as text. For example, given the following bank statement:
CREDIT 04062020 PayPal transfer $4.99
CREDIT 04032020 Payroll $69.73
DEBIT 04022020 ACH transfer $38.25
DEBIT 03242020 IRS tax payment $52249.98
One can write the follow textual regex to match each line:
(CREDIT|DEBIT)\s+(\d{2}\d{2}\d{4})\s+([\w\s]+\w)\s+(\$\d+\.\d{2})
While a regex like this is very compact and expressive, it is very difficult read, write and use:
- Syntactic special characters, e.g.
\
,(
,[
,{
, are too dense to be readable. - It contains a hierarchy of subpatterns fit into a single line of text.
- No code completion when typing syntactic components.
- Capturing groups produce raw data (i.e. a range or a substring) and can only be converted to other data structures after matching.
- While comments
(?#...)
can be added inline, it only complicates readability.
We introduce regex builder, a result-builder-based API for creating and composing regexes. This API resides in a new module named RegexBuilder
that is to be shipped as part of the Swift toolchain.
With regex builder, the regex for matching a bank statement can be written as the following:
import RegexBuilder
enum TransactionKind: String {
case credit = "CREDIT"
case debit = "DEBIT"
}
struct Date {
var month, day, year: Int
init?(mmddyyyy: String) { ... }
}
struct Amount {
var valueTimes100: Int
init?(twoDecimalPlaces text: Substring) { ... }
}
let statementPattern = Regex {
// Parse the transaction kind.
TryCapture {
ChoiceOf {
"CREDIT"
"DEBIT"
}
} transform: {
TransactionKind(rawValue: String($0))
}
OneOrMore(.whitespace)
// Parse the date, e.g. "01012021".
TryCapture {
Repeat(.digit, count: 2)
Repeat(.digit, count: 2)
Repeat(.digit, count: 4)
} transform: { Date(mmddyyyy: $0) }
OneOrMore(.whitespace)
// Parse the transaction description, e.g. "ACH transfer".
Capture {
OneOrMore(CharacterClass(.word, .whitespace))
CharacterClass.word
} transform: { String($0) }
OneOrMore(.whitespace)
"$"
// Parse the amount, e.g. `$100.00`.
TryCapture {
OneOrMore(.digit)
"."
Repeat(.digit, count: 2)
} transform: { Amount(twoDecimalPlaces: $0) }
} // => Regex<(Substring, TransactionKind, Date, String, Amount)>
let statement = """
CREDIT 04062020 PayPal transfer $4.99
CREDIT 04032020 Payroll $69.73
DEBIT 04022020 ACH transfer $38.25
DEBIT 03242020 IRS tax payment $52249.98
"""
for match in statement.matches(of: statementPattern) {
let (line, kind, date, description, amount) = match.output
...
}
Regex builder addresses all of textual regexes' shortcomings presented in the Motivation section:
- Capture groups and quantifiers are expressed as API calls that are easy to read.
- Scoping and indentations clearly distinguish subpatterns in the hierarchy.
- Code completion is available when the developer types an API call.
- Capturing groups can be transformed into structured data at the regex declaration site.
- Normal code comments can be written within a regex declaration to further improve readability.
One of the goals of the regex builder DSL is allowing the developers to easily compose regexes from common currency types and literals, or even define custom patterns to use for matching. We introduce RegexComponent
in the implicitly-imported Swift
module, a protocol that unifies all types that can represent a component of a regex. Since regexes are composable, the Regex
type itself conforms to RegexComponent
.
public protocol RegexComponent<RegexOutput> {
associatedtype RegexOutput
var regex: Regex<RegexOutput> { get }
}
extension Regex: RegexComponent {
public typealias RegexOutput = Output
public var regex: Regex<Output> { self }
}
Note:
RegexComponent
andRegex
's conformance toRegexComponent
are available without importingRegexBuilder
. All other types and conformances introduced in this proposal are in theRegexBuilder
module.- The associated type
RegexOutput
intentionally has aRegex
prefix.Output
would cause confusion in standard library conforming types such asString
, i.e.String.Output
.
By conforming standard library types to RegexComponent
, we allow them to be used inside the regex builder DSL as a match target. These conformances are available in the RegexBuilder
module.
// A string represents a regex that matches the string.
extension String: RegexComponent {
public var regex: Regex<Substring> { get }
}
// A substring represents a regex that matches the substring.
extension Substring: RegexComponent {
public var regex: Regex<Substring> { get }
}
// A character represents a regex that matches the character.
extension Character: RegexComponent {
public var regex: Regex<Substring> { get }
}
// A unicode scalar represents a regex that matches the scalar.
extension UnicodeScalar: RegexComponent {
public var regex: Regex<Substring> { get }
}
// To be introduced in a future pitch.
extension CharacterClass: RegexComponent {
public var regex: Regex<Substring> { get }
}
All of the regex builder DSL in the rest of this pitch will accept generic components that conform to RegexComponent
.
A regex can be viewed as a concatenation of smaller regexes. In the regex builder DSL, RegexComponentBuilder
is the basic facility to allow developers to compose regexes by concatenation.
@resultBuilder
public enum RegexComponentBuilder { ... }
A closure marked with @RegexComponentBuilder
will be transformed to produce a Regex
by concatenating all of its components, where the result type's Output
type will be a Substring
followed by concatenated captures (tuple when plural).
Regex
is a generic type with generic parameterOutput
.struct Regex<Output> { ... }When a regex does not contain any capturing groups, its
Output
type isSubstring
, which represents the whole matched portion of the input.let noCaptures = #/a/# // => Regex<Substring>When a regex contains capturing groups, i.e.
(...)
, theOutput
type is extended as a tuple to also contain capture types. Capture types are tuple elements after the first element.// ________________________________ // .0 | .0 | // ____________________ _________ let yesCaptures = #/a(?:(b+)c(d+))+e(f)?/# // => Regex<(Substring, Substring, Substring, Substring?)> // ---- ---- --- --------- --------- ---------- // .1 | .2 | .3 | .1 | .2 | .3 | // | | | | | | // | | |_______________________________ | ______ | ________| // | | | | // | |______________________________________ | ______ | // | | // |_____________________________________________| // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // Capture types
We introduce a new initializer Regex.init(_:)
which accepts a @RegexComponentBuilder
closure. This initializer is the entry point for creating a regex using the regex builder DSL.
extension Regex {
public init<R: RegexComponent>(
@RegexComponentBuilder _ content: () -> R
) where R.RegexOutput == Output
}
Example:
Regex {
regex0 // Regex<Substring>
regex1 // Regex<(Substring, Int)>
regex2 // Regex<(Substring, Float)>
regex3 // Regex<(Substring, Substring)>
} // Regex<(Substring, Int, Float, Substring)>
This above regex will be transformed to:
Regex {
let e0 = RegexComponentBuilder.buildExpression(regex0) // Regex<Substring>
let e1 = RegexComponentBuilder.buildExpression(regex1) // Regex<(Substring, Int)>
let e2 = RegexComponentBuilder.buildExpression(regex2) // Regex<(Substring, Float)>
let e3 = RegexComponentBuilder.buildExpression(regex3) // Regex<(Substring, Substring)>
let r0 = RegexComponentBuilder.buildPartialBlock(first: e0)
let r1 = RegexComponentBuilder.buildPartialBlock(accumulated: r0, next: e1)
let r2 = RegexComponentBuilder.buildPartialBlock(accumulated: r1, next: e2)
let r3 = RegexComponentBuilder.buildPartialBlock(accumulated: r2, next: e3)
return r3
} // Regex<(Substring, Int, Float, Substring)>
The following example creates a regex by concatenating subpatterns.
let regex = Regex {
"regex builder "
"is "
"so easy"
}
let match = try regex.prefixMatch(in: "regex builder is so easy!")
match?.0 // => "regex builder is so easy"
API definition
Basic methods in RegexComponentBuilder
, e.g. buildBlock()
, provides support for creating the most fundamental blocks. The buildExpression
method wraps a user-provided component in a RegexComponentBuilder.Component
structure, before passing the component to other builder methods. This is used for saving the source location of the component so that runtime errors can be reported with an accurate location.
@resultBuilder
public enum RegexComponentBuilder {
/// Returns an empty regex.
public static func buildBlock() -> Regex<Substring>
/// A builder component that stores a regex component and its source location
/// for debugging purposes.
public struct Component<Value: RegexComponent> {
public var value: Value
public var file: String
public var function: String
public var line: Int
public var column: Int
}
/// Returns a component by wrapping the component regex in `Component` and
/// recording its source location.
public static func buildExpression<R: RegexComponent>(
_ regex: R,
file: String = #file,
function: String = #function,
line: Int = #line,
column: Int = #column
) -> Component<R>
}
RegexComponentBuilder
utilizes buildPartialBlock
to be able to concatenate all components' capture types to a single result tuple. buildPartialBlock(first:)
provides support for creating a regex from a single component, and buildPartialBlock(accumulated:next:)
support for creating a regex from multiple results.
Before Swift supports variadic generics, buildPartialBlock(accumulated:next:)
must be overloaded to support concatenating regexes of supported capture quantities (arities). It is overloaded up to arity^2
times to account for all possible pairs of regexes that make up 10 captures.
In the initial version of the DSL, we plan to support regexes with up to 10 captures, as 10 captures are sufficient for most use cases. These overloads can be superseded by variadic versions of buildPartialBlock(first:)
and buildPartialBlock(accumulated:next:)
in a future release.
extension RegexComponentBuilder {
@_disfavoredOverload
public static func buildPartialBlock<R: RegexComponent>(
first r: Component<R>
) -> Regex<R.RegexOutput>
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single method:
//
// public static func buildPartialBlock<
// AccumulatedWholeMatch, NextWholeMatch,
// AccumulatedCapture..., NextCapture...,
// Accumulated: RegexComponent, Next: RegexComponent
// >(
// accumulated: Accumulated, next: Component<Next>
// ) -> Regex<(Substring, AccumulatedCapture..., NextCapture...)>
// where Accumulated.RegexOutput == (AccumulatedWholeMatch, AccumulatedCapture...),
// Next.RegexOutput == (NextWholeMatch, NextCapture...)
public static func buildPartialBlock<W0, W1, C0, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0)> where R0.RegexOutput == W0, R1.RegexOutput == (W1, C0)
public static func buildPartialBlock<W0, W1, C0, C1, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0, C1)> where R0.RegexOutput == W0, R1.RegexOutput == (W1, C0, C1)
public static func buildPartialBlock<W0, W1, C0, C1, C2, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0, C1, C2)> where R0.RegexOutput == W0, R1.RegexOutput == (W1, C0, C1, C2)
// ... `O(arity^2)` overloads of `buildPartialBlock(accumulated:next:)`
}
To support if #available(...)
statements, buildLimitedAvailability(_:)
is defined with overloads to support up to 10 captures. The overload for non-capturing regexes, due to the lack of generic constraints, must be annotated with @_disfavoredOverload
in order not shadow other overloads. We expect that a variadic-generic version of this method will eventually superseded all of these overloads.
extension RegexComponentBuilder {
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single method:
//
// public static func buildLimitedAvailability<
// Component, WholeMatch, Capture...
// >(
// _ component: Component
// ) where Component.RegexOutput == (WholeMatch, Capture...)
@_disfavoredOverload
public static func buildLimitedAvailability<R: RegexComponent>(
_ component: Component<R>
) -> Regex<Substring>
public static func buildLimitedAvailability<W, C0, R: RegexComponent>(
_ component: Component<R>
) -> Regex<(Substring, C0?)>
public static func buildLimitedAvailability<W, C0, C1, R: RegexComponent>(
_ component: Component<R>
) -> Regex<(Substring, C0?, C1?)>
// ... `O(arity)` overloads of `buildLimitedAvailability(_:)`
}
buildOptional
and buildEither
are intentionally not supported due to ergonomic issues and fundamental semantic differences between regex conditionals and result builder conditionals. Please refer to the alternatives considered section for detailed rationale.
Capture is a common regex feature that saves a portion of the input upon match. In regex builder, Capture
and TryCapture
are regex components that produce a new regex by inserting the captured pattern's whole match (.0
) to the .1
position of RegexOutput
. When a transform closure is provided, the whole match (.0
) of the captured content will be transformed to using the closure.
public struct Capture<Output>: RegexComponent { ... }
public struct TryCapture<Output>: RegexComponent { ... }
To do a simple capture, you provide Capture
with a regex component or a regex component builder closure.
// Equivalent: '(CREDIT|DEBIT)'
Capture {
ChoiceOf {
"CREDIT"
"DEBIT"
}
} // `.RegexOutput == (Substring, Substring)`
A capture will be represented in the type signature as a slice of the input, i.e. Substring
. To transform the captured substring into another value during matching, specify a transform:
closure.
// This example is similar to the one above, however in this example we
// transform the result of the capture into:
// "Transaction Kind: CREDIT" or "Transaction Kind: DEBIT"
Capture {
ChoiceOf {
"CREDIT"
"DEBIT"
}
} transform: {
"Transaction Kind: \($0)"
} // `.RegexOutput == (Substring, String)`
The transform closure can throw. When a transform closure throws during matching, the matching will abort and the error will be propagated directly to the top-level matching API that's being called, e.g. Regex.wholeMatch(in:)
and Regex.prefixMatch(in:)
. Aborting is useful for cases where you know that matching can never succeed or when you detect that an important invariant has been violated and the matching procedure needs to be aborted.
An alternative version of capture is called TryCapture
, which works in cases where you want to transform the capture, but the transformation may return nil. When a nil is returned, the regex engine backtracks and tries an alternative. For example, TryCapture
makes it easy to directly transform a capture by calling a failable initializer during matching.
enum TransactionKind: String {
case credit = "CREDIT"
case debit = "DEBIT"
}
TryCapture {
ChoiceOf {
"CREDIT"
"DEBIT"
}
} transform: {
// This initializer may return nil which is why we used TryCapture.
TransactionKind(rawValue: String($0))
}
API definition
public struct Capture<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
public struct TryCapture<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
Below are Capture
and TryCapture
initializer variants on capture arity 0. Higher capture arities are omitted for simplicity.
extension Capture {
public init<R: RegexComponent, W>(
_ component: R
) where Output == (Substring, W), R.RegexOutput == W
public init<R: RegexComponent, W>(
_ component: R, as reference: Reference<W>
) where Output == (Substring, W), R.RegexOutput == W
public init<R: RegexComponent, W, NewCapture>(
_ component: R,
transform: @Sendable @escaping (W) throws -> NewCapture
) where Output == (Substring, NewCapture), R.RegexOutput == W
public init<R: RegexComponent, W>(
@RegexComponentBuilder _ component: () -> R
) where Output == (Substring, W), R.RegexOutput == W
// ... `O(arity)` overloads
}
extension TryCapture {
public init<R: RegexComponent, W, NewCapture>(
_ component: R,
transform: @Sendable @escaping (W) throws -> NewCapture?
) where Output == (Substring, NewCapture), R.RegexOutput == W
public init<R: RegexComponent, W, NewCapture>(
@RegexComponentBuilder _ component: () -> R,
transform: @Sendable @escaping (W) throws -> NewCapture?
) where Output == (Substring, NewCapture), R.RegexOutput == W
// ... `O(arity)` overloads
}
In addition to transforming individual captures within a regex, you can also map the output of an entire regex to a different output type. You can use the mapOutput(_:)
methods to reorder captures, flatten nested optionals, or create instances of a custom type.
This example shows how you can transform the output of a regex with three capture groups into an instance of a custom SemanticVersion
type, matching strings such as "1.0.0"
or "1.0"
:
struct SemanticVersion: Hashable {
var major, minor, patch: Int
}
let semverRegex = Regex {
TryCapture(OneOrMore(.digit)) { Int($0) }
"."
TryCapture(OneOrMore(.digit)) { Int($0) }
Optionally {
"."
TryCapture(OneOrMore(.digit)) { Int($0) }
}
}.mapOutput { _, c1, c2, c3 in
SemanticVersion(major: c1, minor: c2, patch: c3 ?? 0)
}
let semver1 = "1.11.4".firstMatch(of: semverRegex)?.output
// semver1 == SemanticVersion(major: 1, minor: 11, patch: 4)
let semver2 = "0.6".firstMatch(of: semverRegex)?.output
// semver2 == SemanticVersion(major: 0, minor: 6, patch: 0)
API definition
Note: This extension is defined in the standard library, not the RegexBuilder
module.
extension Regex {
/// Returns a regex that transforms its matches using the given closure.
///
/// When you call `mapOutput(_:)` on a regex, you change the type of
/// output available on each match result. The `body` closure is called
/// when each match is found to transform the result of the match.
///
/// - Parameter body: A closure for transforming the output of this
/// regex.
/// - Returns: A regex that has `NewOutput` as its output type.
func mapOutput<NewOutput>(_ body: @escaping (Output) -> NewOutput) -> Regex<NewOutput>
}
Reference is a feature that can be used to achieve named captures and named backreferences from textual regexes. Simply state what type the reference will hold on to and you can use it later once you've matched a string to get back a specific capture. Note the type you pass to reference will be whatever the result of a capture's transform is. A capture with no transform always has a reference type of Substring
.
let kind = Reference(Substring.self)
let regex = Capture(as: kind) {
ChoiceOf {
"CREDIT"
"DEBIT"
}
}
let input = "CREDIT"
if let result = try regex.firstMatch(in: input) {
print(result[kind]) // Optional("CREDIT")
}
Capturing stores the most recently captured content, and references can be used as a name to look up the result of matching. The reference itself can also be used within a regex (commonly called a "backreference") to match the most recently captured content during matching.
let a = Reference(Substring.self)
let b = Reference(Substring.self)
let c = Reference(Substring.self)
let regex = Regex {
Capture("abc", as: a)
Capture("def", as: b)
ZeroOrMore {
Capture("hij", as: c)
}
a
Capture(b)
}
if let result = try regex.firstMatch(in: "abcdefabcdef") {
print(result[a]) // => Optional("abc")
print(result[b]) // => Optional("def")
print(result[c]) // => nil
}
A regex is considered invalid when it contains a use of reference without it ever being used as the as:
argument to an initializer of Capture
or TryCapture
in the regex. When this occurs in the regex builder DSL, a runtime error will be reported.
Similarly, the argument to a Regex.Match.subscript(_:)
must have been used as the as:
argument to an initializer of Capture
or TryCapture
in the regex that produced the match.
API definition
/// A reference to a regex capture.
public struct Reference<Capture>: RegexComponent {
public init(_ captureType: Capture.Type = Capture.self)
public var regex: Regex<Capture>
}
extension Capture {
public init<R: RegexComponent, W, NewCapture>(
_ component: R,
as reference: Reference<NewCapture>,
transform: @escaping (Substring) throws -> NewCapture
) where Output == (Substring, NewCapture), R.RegexOutput == W
public init<R: RegexComponent, W>(
as reference: Reference<W>,
@RegexComponentBuilder _ component: () -> R
) where Output == (Substring, W), R.RegexOutput == W
// ... `O(arity)` overloads
}
extension TryCapture {
public init<R: RegexComponent, W, NewCapture>(
_ component: R,
as reference: Reference<NewCapture>,
transform: @escaping (Substring) throws -> NewCapture?
) where Output == (Substring, NewCapture), R.RegexOutput == W
public init<R: RegexComponent, W, NewCapture>(
as reference: Reference<NewCapture>,
@RegexComponentBuilder _ component: () -> R,
transform: @escaping (Substring) throws -> NewCapture?
) where Output == (Substring, NewCapture), R.RegexOutput == W
// ... `O(arity)` overloads
}
extension Regex.Match {
/// Returns the capture referenced by the given reference.
///
/// - Precondition: The reference must have been captured in the regex that produced this match.
public subscript<Capture>(_ reference: Reference<Capture>) -> Capture? { get }
}
An alternation is used to match one of multiple patterns. When one pattern in an alternation does not match successfully, the regex engine tries the next pattern until there's a successful match. An alternation wraps its underlying patterns' capture types in an Optional
and concatenates them together, first to last.
let choice = ChoiceOf {
regex0 // Regex<Substring>
regex1 // Regex<(Substring, Int)>
regex2 // Regex<(Substring, Float)>
regex3 // Regex<(Substring, Substring)>
} // => Regex<(Substring, Int?, Float?, Substring?)>
AlternationBuilder
is a result builder type for creating alternations from components of a block.
@resultBuilder
public struct AlternationBuilder { ... }
To the developer, the top-level API is a type named ChoiceOf
. This type has an initializer that accepts an @AlternationBuilder
closure.
public struct ChoiceOf<Output>: RegexComponent {
...
public init<R: RegexComponent>(
@AlternationBuilder builder: () -> R
) where R.RegexOutput == Output
}
For example, the following code creates an alternation of two subpatterns.
let regex = Regex {
ChoiceOf {
"CREDIT"
"DEBIT"
}
}
let match = try regex.prefixMatch(in: "DEBIT 04032020 Payroll $69.73")
match?.0 // => "DEBIT"
API definition
AlternationBuilder
is mostly similar to RegexComponent
with the following distinctions:
- Empty blocks are not supported.
- Capture types are wrapped in a layer of
Optional
before being concatenated in the resultingOutput
type. buildEither(first:)
andbuildEither(second:)
are overloaded for each supported capture arity because they need to wrap capture types inOptional
.
public struct ChoiceOf<Output>: RegexComponent {
public var regex: Regex<Output> { get }
public init<R: RegexComponent>(
@AlternationBuilder builder: () -> R
) where R.RegexOutput == Output
}
@resultBuilder
public enum AlternationBuilder {
public typealias Component<Value> = RegexComponentBuilder.Component<Value>
/// Returns a component by wrapping the component regex in `Component` and
/// recording its source location.
public static func buildExpression<R: RegexComponent>(
_ regex: R,
file: String = #file,
function: String = #function,
line: Int = #line,
column: Int = #column
) -> Component<R>
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single method:
//
// public static func buildPartialBlock<
// R, WholeMatch, Capture...
// >(
// first component: Component<R>
// ) -> Regex<(Substring, Capture?...)>
// where Component.RegexOutput == (WholeMatch, Capture...),
@_disfavoredOverload
public static func buildPartialBlock<R: RegexComponent>(
first r: Component<R>
) -> Regex<Substring>
public static func buildPartialBlock<W, C0, R: RegexComponent>(
first r: Component<R>
) -> Regex<(Substring, C0?)> where R.RegexOutput == (W, C0)
public static func buildPartialBlock<W, C0, C1, R: RegexComponent>(
first r: Component<R>
) -> Regex<(Substring, C0?, C1?)> where R.RegexOutput == (W, C0, C1)
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single method:
//
// public static func buildPartialBlock<
// AccumulatedWholeMatch, NextWholeMatch,
// AccumulatedCapture..., NextCapture...,
// Accumulated: RegexComponent, Next: RegexComponent
// >(
// accumulated: Accumulated, next: Component<Next>
// ) -> Regex<(Substring, AccumulatedCapture..., NextCapture...)>
// where Accumulated.RegexOutput == (AccumulatedWholeMatch, AccumulatedCapture...),
// Next.RegexOutput == (NextWholeMatch, NextCapture...)
public static func buildPartialBlock<W0, W1, C0, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0?)> where R0.RegexOutput == W0, R1.RegexOutput == (W1, C0)
public static func buildPartialBlock<W0, W1, C0, C1, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0?, C1?)> where R0.RegexOutput == W0, R1.RegexOutput == (W1, C0, C1)
public static func buildPartialBlock<W0, W1, C0, C1, C2, R0: RegexComponent, R1: RegexComponent>(
accumulated: R0, next: Component<R1>
) -> Regex<(Substring, C0?, C1?, C2?)> where R0.RegexOutput == W0, R1.RegexOutput == (W1, C0, C1, C2)
// ... `O(arity^2)` overloads of `buildPartialBlock(accumulated:next:)`
}
extension AlternationBuilder {
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single method:
//
// public static func buildLimitedAvailability<
// Component, WholeMatch, Capture...
// >(
// _ component: Component
// ) -> Regex<(Substring, Capture?...)>
// where Component.RegexOutput == (WholeMatch, Capture...)
@_disfavoredOverload
public static func buildLimitedAvailability<R: RegexComponent>(
_ component: Component<R>
) -> Regex<Substring>
public static func buildLimitedAvailability<W, C0, R: RegexComponent>(
_ component: Component<R>
) -> Regex<(Substring, C0?)>
public static func buildLimitedAvailability<W, C0, C1, R: RegexComponent>(
_ component: Component<R>
) -> Regex<(Substring, C0?, C1?)>
// ... `O(arity)` overloads of `buildLimitedAvailability(_:)`
public static func buildLimitedAvailability<W, C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, R: RegexComponent>(
_ component: Component<R>
) -> Regex<(Substring, C0?, C1?, C2?, C3?, C4?, C5?, C6?, C7?, C8, C9?)> where R.RegexOutput == (W, C0, C1, C2, C3, C4, C5, C6, C7, C8, C9)
}
One of the most useful features of regex is repetition, aka. quantification, as it allows you to match a specific range of number of occurrences of a subpattern. Regex builder provides 5 repetition components: One
, OneOrMore
, ZeroOrMore
, Optionally
, and Repeat
.
public struct One<Output>: RegexComponent { ... }
public struct OneOrMore<Output>: RegexComponent { ... }
public struct ZeroOrMore<Output>: RegexComponent { ... }
public struct Optionally<Output>: RegexComponent { ... }
public struct Repeat<Output>: RegexComponent { ... }
Repetition in regex builder | Textual regex equivalent |
---|---|
One(...) |
... |
OneOrMore(...) |
...+ |
ZeroOrMore(...) |
...* |
Optionally(...) |
...? |
Repeat(..., count: n) |
...{n} |
Repeat(..., n...) |
...{n,} |
Repeat(..., n...m) |
...{n,m} |
One
, OneOrMore
and count-based Repeat
are quantifiers that produce a new regex with the original capture types. Their Output
type is Substring
followed by the component's capture types. ZeroOrMore
, Optionally
, and range-based Repeat
are quantifiers that produce a new regex with optional capture types. Their Output
type is Substring
followed by the component's capture types wrapped in Optional
.
Quantifier | Component Output |
Result Output |
---|---|---|
One OneOrMore Repeat(..., count: ...) |
(WholeMatch, Capture...) |
(Substring, Capture...) |
One OneOrMore Repeat(..., count: ...) |
WholeMatch (non-tuple) |
Substring |
ZeroOrMore Optionally Repeat(..., n...m) |
(WholeMatch, Capture...) |
(Substring, Capture?...) |
ZeroOrMore Optionally Repeat(..., n...m) |
WholeMatch (non-tuple) |
Substring |
API definition
public struct One<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
public struct OneOrMore<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
public struct ZeroOrMore<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
public struct Optionally<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
public struct Repeat<Output>: RegexComponent {
public var regex: Regex<Output> { get }
}
Due to the lack of variadic generics, initializers must be overloaded for every supported capture arity.
extension One {
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single set of methods:
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// _ component: Component,
// _ behavior: RegexRepetitionBehavior = .eager
// )
// where Output == (Substring, Capture...)>,
// Component.RegexOutput == (WholeMatch, Capture...)
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// _ behavior: RegexRepetitionBehavior = .eager,
// @RegexComponentBuilder _ component: () -> Component
// )
// where Output == (Substring, Capture...),
// Component.RegexOutput == (WholeMatch, Capture...)
@_disfavoredOverload
public init<Component: RegexComponent>(
_ component: Component,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == Substring
@_disfavoredOverload
public init<Component: RegexComponent>(
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == Substring
public init<W, C0, Component: RegexComponent>(
_ component: Component,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == (Substring, C0), Component.RegexOutput == (W, C0)
public init<W, C0, Component: RegexComponent>(
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == (Substring, C0), Component.RegexOutput == (W, C0)
// ... `O(arity)` overloads
}
extension OneOrMore {
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single set of methods:
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// _ component: Component,
// _ behavior: RegexRepetitionBehavior = .eager
// )
// where Output == (Substring, Capture...)>,
// Component.RegexOutput == (WholeMatch, Capture...)
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// _ behavior: RegexRepetitionBehavior = .eager,
// @RegexComponentBuilder _ component: () -> Component
// )
// where Output == (Substring, Capture...),
// Component.RegexOutput == (WholeMatch, Capture...)
@_disfavoredOverload
public init<Component: RegexComponent>(
_ component: Component,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == Substring
@_disfavoredOverload
public init<Component: RegexComponent>(
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == Substring
public init<W, C0, Component: RegexComponent>(
_ component: Component,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == (Substring, C0), Component.RegexOutput == (W, C0)
public init<W, C0, Component: RegexComponent>(
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == (Substring, C0), Component.RegexOutput == (W, C0)
// ... `O(arity)` overloads
}
extension ZeroOrMore {
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single set of methods:
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// _ component: Component,
// _ behavior: RegexRepetitionBehavior = nil
// )
// where Output == (Substring, Capture?...)>,
// Component.RegexOutput == (WholeMatch, Capture...)
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// _ behavior: RegexRepetitionBehavior? = nil,
// @RegexComponentBuilder _ component: () -> Component
// )
// where Output == (Substring, Capture?...),
// Component.RegexOutput == (WholeMatch, Capture...)
@_disfavoredOverload
public init<Component: RegexComponent>(
_ component: Component,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == Substring
@_disfavoredOverload
public init<Component: RegexComponent>(
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == Substring
public init<W, C0, Component: RegexComponent>(
_ component: Component,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == (Substring, C0?), Component.RegexOutput == (W, C0)
public init<W, C0, Component: RegexComponent>(
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == (Substring, C0?), Component.RegexOutput == (W, C0)
// ... `O(arity)` overloads
}
extension Optionally {
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single set of methods:
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// _ component: Component,
// _ behavior: RegexRepetitionBehavior? = nil
// )
// where Output == (Substring, Capture?...),
// Component.RegexOutput == (WholeMatch, Capture...)
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// _ behavior: RegexRepetitionBehavior? = nil,
// @RegexComponentBuilder _ component: () -> Component
// )
// where Output == (Substring, Capture?...)>,
// Component.RegexOutput == (WholeMatch, Capture...)
@_disfavoredOverload
public init<Component: RegexComponent>(
_ component: Component,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == Substring
@_disfavoredOverload
public init<Component: RegexComponent>(
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == Substring
public init<W, C0, Component: RegexComponent>(
_ component: Component,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == (Substring, C0?), Component.RegexOutput == (W, C0)
public init<W, C0, Component: RegexComponent>(
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == (Substring, C0?), Component.RegexOutput == (W, C0)
// ... `O(arity)` overloads
}
extension Repeat {
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single set of methods:
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// _ component: Component,
// count: Int,
// _ behavior: RegexRepetitionBehavior? = nil
// )
// where Output == (Substring, Capture...),
// Component.RegexOutput == (WholeMatch, Capture...)
//
// public init<
// Component: RegexComponent, WholeMatch, Capture...
// >(
// count: Int,
// _ behavior: RegexRepetitionBehavior? = nil,
// @RegexComponentBuilder _ component: () -> Component
// )
// where Output == (Substring, Capture...),
// Component.RegexOutput == (WholeMatch, Capture...)
//
// public init<
// Component: RegexComponent, WholeMatch, Capture..., RE: RangeExpression
// >(
// _ component: Component,
// _ expression: RE,
// _ behavior: RegexRepetitionBehavior? = nil
// )
// where Output == (Substring, Capture?...),
// Component.RegexOutput == (WholeMatch, Capture...)
//
// public init<
// Component: RegexComponent, WholeMatch, Capture..., RE: RangeExpression
// >(
// _ expression: RE,
// _ behavior: RegexRepetitionBehavior? = nil,
// @RegexComponentBuilder _ component: () -> Component
// )
// where Output == (Substring, Capture?...),
// Component.RegexOutput == (WholeMatch, Capture...)
// Nullary
@_disfavoredOverload
public init<Component: RegexComponent>(
_ component: Component,
count: Int,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == Substring, R.Bound == Int
@_disfavoredOverload
public init<Component: RegexComponent>(
count: Int,
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == Substring, R.Bound == Int
@_disfavoredOverload
public init<Component: RegexComponent, RE: RangeExpression>(
_ component: Component,
_ expression: RE,
_ behavior: RegexRepetitionBehavior? = nil
) where Output == Substring, R.Bound == Int
@_disfavoredOverload
public init<Component: RegexComponent, RE: RangeExpression>(
_ expression: RE,
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
) where Output == Substring, R.Bound == Int
// Unary
public init<W, C0, Component: RegexComponent>(
_ component: Component,
count: Int,
_ behavior: RegexRepetitionBehavior? = nil
)
where Output == (Substring, C0),
Component.RegexOutput == (Substring, C0),
R.Bound == Int
public init<W, C0, Component: RegexComponent>(
count: Int,
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
)
where Output == (Substring, C0),
Component.RegexOutput == (Substring, C0),
R.Bound == Int
public init<W, C0, Component: RegexComponent, RE: RangeExpression>(
_ component: Component,
_ expression: RE,
_ behavior: RegexRepetitionBehavior? = nil
)
where Output == (Substring, C0?),
Component.RegexOutput == (W, C0),
R.Bound == Int
public init<W, C0, Component: RegexComponent, RE: RangeExpression>(
_ expression: RE,
_ behavior: RegexRepetitionBehavior? = nil,
@RegexComponentBuilder _ component: () -> Component
)
where Output == (Substring, C0?),
Component.RegexOutput == (W, C0),
R.Bound == Int
// ... `O(arity)` overloads
}
Repetition behavior defines how eagerly a repetition component should match the input. Behavior can be unspecified, in which case it will default to .eager
unless an option is provided to change the default (see Unicode for String Processing).
/// Specifies how much to attempt to match when using a quantifier.
public struct RegexRepetitionBehavior {
/// Match as much of the input string as possible, backtracking when
/// necessary.
public static var eager: RegexRepetitionBehavior { get }
/// Match as little of the input string as possible, expanding the matched
/// region as necessary to complete a match.
public static var reluctant: RegexRepetitionBehavior { get }
/// Match as much of the input string as possible, performing no backtracking.
public static var possessive: RegexRepetitionBehavior { get }
}
Repetition behavior in regex builder | Textual regex equivalent |
---|---|
.eager |
no suffix |
.reluctant |
suffix ? |
.possessive |
suffix + |
To demonstrate how each repetition behavior works, let's look at the following
example. Suppose we want to make a regex that wants to capture an html tag, e.g.
<code>
. We might start with something like the following:
let tag = Reference(Substring.self)
let htmlRegex = Regex {
"<"
Capture(as: tag) {
// Remember, the default behavior is .eager here!
OneOrMore(.any)
}
">"
}
let input = #"<code>print("hello world!")</code>"#
if let result = htmlRegex.firstMatch(in: input) {
print(result[tag])
}
The code above prints code>print("hello world!")</code
, which is unexpected. This is because OneOrMore(.any)
has eager behavior by default, and it matched as many characters as possible.
If we change OneOrMore(.any)
to OneOrMore(.any, .possessive)
, matching fails. What happened in this case was that the regex found our starting "<", but the repetition regex component OneOrMore(.any, .possessive)
ran all the way to the end of the string (because we're asking for any character). After reaching the end, we couldn't find a match for the end ">"
because our string was out of characters. This is intended for .possessive
because it doesn't backtrack the string to find a match for the ending ">"
.
The desired behavior in this case is .reluctant
, where the repetition will match as little of the input string as possible. If we use OneOrMore(.any, .reluctant)
, the code prints expected output <code>
.
Anchors are a way to constrain a regex, or part of a regex, to matching particular locations within an input string. Regex builder provides anchors that correspond to regex syntax anchors. Regex builder also provides two types that represent look-ahead assertions — essentially a non-consuming sub-regex that has to match (or not match) before the regex can proceed.
/// A regex component that matches a specific condition at a particular position
/// in an input string.
///
/// You can use anchors to guarantee that a match only occurs at certain points
/// in an input string, such as at the beginning of the string or at the end of
/// a line.
public struct Anchor: RegexComponent {
/// An anchor that matches at the start of a line, including the start of
/// the input string.
///
/// This anchor is equivalent to `^` in regex syntax when the `m` option
/// has been enabled or `anchorsMatchLineEndings(true)` has been called.
public static var startOfLine: Anchor { get }
/// An anchor that matches at the end of a line, including at the end of
/// the input string.
///
/// This anchor is equivalent to `$` in regex syntax when the `m` option
/// has been enabled or `anchorsMatchLineEndings(true)` has been called.
public static var endOfLine: Anchor { get }
/// An anchor that matches at a word boundary.
///
/// Word boundaries are identified using the Unicode default word boundary
/// algorithm by default. To specify a different word boundary algorithm,
/// see the `RegexComponent.wordBoundaryKind(_:)` method.
///
/// This anchor is equivalent to `\b` in regex syntax.
public static var wordBoundary: Anchor { get }
/// An anchor that matches at the start of the input string.
///
/// This anchor is equivalent to `\A` in regex syntax.
public static var startOfSubject: Anchor { get }
/// An anchor that matches at the end of the input string.
///
/// This anchor is equivalent to `\z` in regex syntax.
public static var endOfSubject: Anchor { get }
/// An anchor that matches at the end of the input string or at the end of
/// the line immediately before the the end of the string.
///
/// This anchor is equivalent to `\Z` in regex syntax.
public static var endOfSubjectBeforeNewline: Anchor { get }
/// An anchor that matches at a grapheme cluster boundary.
///
/// This anchor is equivalent to `\y` in regex syntax.
public static var textSegmentBoundary: Anchor { get }
/// An anchor that matches at the first position of a match in the input
/// string.
///
/// This anchor is equivalent to `\y` in regex syntax.
public static var firstMatchingPositionInSubject: Anchor { get }
/// The inverse of this anchor, which matches at every position that this
/// anchor does not.
///
/// For the `wordBoundary` and `textSegmentBoundary` anchors, the inverted
/// version corresponds to `\B` and `\Y`, respectively.
public var inverted: Anchor { get }
}
/// A regex component that allows a match to continue only if its contents
/// match at the given location.
///
/// A lookahead is a zero-length assertion that its included regex matches at
/// a particular position. Lookaheads do not advance the overall matching
/// position in the input string — once a lookahead succeeds, matching continues
/// in the regex from the same position.
public struct Lookahead: RegexComponent {
/// Creates a lookahead from the given regex component.
public init(_ component: some RegexComponent)
/// Creates a lookahead from the regex generated by the given builder closure.
public init(@RegexComponentBuilder _ component: () -> some RegexComponent)
}
/// A regex component that allows a match to continue only if its contents
/// do not match at the given location.
///
/// A negative lookahead is a zero-length assertion that its included regex
/// does not match at a particular position. Lookaheads do not advance the
/// overall matching position in the input string — once a lookahead succeeds,
/// matching continues in the regex from the same position.
public struct NegativeLookahead: RegexComponent {
/// Creates a negative lookahead from the given regex component.
public init(_ component: some RegexComponent)
/// Creates a negative lookahead from the regex generated by the given builder
/// closure.
public init(@RegexComponentBuilder _ component: () -> some RegexComponent)
}
In textual regex, one can refer to a subpattern to avoid duplicating the subpattern, for example:
(you|I) say (goodbye|hello); (?1) say (?2)
The above regex is equivalent to
(you|I) say (goodbye|hello); (you|I) say (goodbye|hello)
With regex builder, there is no special API required to reuse existing subpatterns, as a subpattern can be defined modularly using a let
binding inside or outside a regex builder closure.
Regex {
let subject = ChoiceOf {
"I"
"you"
}
let object = ChoiceOf {
"goodbye"
"hello"
}
subject
"say"
object
";"
subject
"say"
object
}
Because the regex engine backtracks by default when trying to match on a string, sometimes this backtracking can be wasted performance because we don't want to try various possibilities to eventually (maybe) find a match.
In textual regexes, atomic groups ((?>...)
) solve this problem by informing the regex engine to actually discard the backtrack location of a group, that is, defining a scope for backtracking. In regex builder, the Local
type serves this purpose.
public struct Local<Output>: RegexComponent { ... }
For example, the following regex matches string abcc
but not abc
.
Regex {
"a"
Local {
ChoiceOf {
"bc"
"b"
}
}
"c"
}
If our input is abcc
, we'll successfully find a match, however if we try to match against abc
we won't get a match. The reason behind this is that in the ChoiceOf
we actually matched the "bc" case first, but due to the local group we immediately disregard the backtracking location and continue to try and the rest of the regex. Since we matched the "bc", we don't have anymore string left to match the "c" and our local group will not try and attempt to match the other option, "b".
API definition
public struct Local<Output>: RegexComponent {
public var regex: Regex<Output>
// The following builder methods implement what would be possible with
// variadic generics (using imaginary syntax) as a single set of methods:
//
// public init<WholeMatch, Capture..., Component: RegexComponent>(
// @RegexComponentBuilder _ component: () -> Component
// ) where Output == (Substring, Capture...), Component.RegexOutput == (WholeMatch, Capture...)
@_disfavoredOverload
public init<Component: RegexComponent>(
@RegexComponentBuilder _ component: () -> Component
) where Output == Substring
public init<W, C0, Component: RegexComponent>(
@RegexComponentBuilder _ component: () -> Component
) where Output == (Substring, C0), Component.RegexOutput == (W, C0)
public init<W, C0, C1, Component: RegexComponent>(
@RegexComponentBuilder _ component: () -> Component
) where Output == (Substring, C0, C1), Component.RegexOutput == (W, C0, C1)
// ... `O(arity)` overloads
}
Let's put everything together now and parse this example bank statement.
CREDIT 04062020 PayPal transfer $4.99
CREDIT 04032020 Payroll $69.73
DEBIT 04022020 ACH transfer $38.25
DEBIT 03242020 IRS tax payment $52249.98
Here we have 2 types of transaction kinds, CREDIT and DEBIT, we have a date denoted by mmddyyyy, a description, and the amount paid.
enum TransactionKind: String {
case credit = "CREDIT"
case debit = "DEBIT"
}
struct Date {
var month: Int
var day: Int
var year: Int
init?(mmddyyyy: String) {
...
}
}
let statementRegex = Regex {
// First, let's capture the transaction kind by wrapping our `ChoiceOf` in a
// `TryCapture` because our initializer can return nil on failure.
TryCapture {
ChoiceOf {
"CREDIT"
"DEBIT"
}
} transform: {
TransactionKind(rawValue: String($0))
}
OneOrMore(.whitespace)
// Next, lets represent our date as 3 separate repeat quantifiers. The first
// two will require 2 digit characters, and the last will require 4. Then
// we'll take the entire substring and try to parse a date out.
TryCapture {
Repeat(.digit, count: 2)
Repeat(.digit, count: 2)
Repeat(.digit, count: 4)
} transform: {
Date(mmddyyyy: String($0))
}
OneOrMore(.whitespace)
// Next, grab the description which can be any combination of word characters,
// digits, etc.
Capture {
OneOrMore(.any, .reluctant)
}
OneOrMore(.whitespace)
"$"
// Finally, we'll grab one or more digits which will represent the whole
// dollars, match the decimal point, and finally get 2 digits which will be
// our cents.
TryCapture {
OneOrMore(.digit)
"."
Repeat(.digit, count: 2)
} transform: {
Double($0)
}
}
for match in statement.matches(of: statementRegex) {
let (line, kind, date, description, amount) = match.output
...
}
Regex builder will be shipped in a new module named RegexBuilder
, and thus will not affect the source compatibility of the existing code.
The proposed feature does not change the ABI of existing features.
The proposed feature relies heavily upon overloads of buildBlock
and buildPartialBlock(accumulated:next:)
to work for different capture arities. In the fullness of time, we are hoping for variadic generics to supersede existing overloads. Such a change should not involve ABI-breaking modifications as it is merely a change of overload resolution.
Sometimes it may be useful to convert a regex created using regex builder to textual regex. This may be achieved in the future by extending RegexComponent
with a computed property.
extension RegexComponent {
public func makeTextualRegex() -> String?
}
It is worth noting that the internal representation of a Regex
is not textual regex, but an efficient pattern matching bytecode compiled from an abstract syntax tree. Moreover, not every Regex
can be converted to textual regex. Regex builder supports arbitrary types that conform to the RegexComponent
protocol, including CustomMatchingRegexComponent
(pitched in String Processing Algorithms) which can be implemented with arbitrary code. If a Regex
contains a CustomMatchingRegexComponent
, it cannot be converted to textual regex.
Sometimes, a textual regex may also use (?R)
or (?0)
to recusively evaluate the entire regex. For example, the following textual regex matches "I say you say I say you say hello".
(you|I) say (goodbye|hello|(?R))
For this, Regex
offers a special initializer that allows its pattern to recursively reference itself. This is somewhat akin to a fixed-point combinator.
extension Regex {
public init<R: RegexComponent>(
@RegexComponentBuilder _ content: (Regex<Substring>) -> R
) where R.RegexOutput == Match
}
With this initializer, the above regex can be expressed as the following using regex builder.
Regex { wholeSentence in
ChoiceOf {
"I"
"you"
}
"say"
ChoiceOf {
"goodbye"
"hello"
wholeSentence
}
}
There are some concerns with this design which we need to consider:
- Due to the lack of labeling, the argument to the builder closure can be arbitrarily named and cause confusion.
- When there is an initializer that accepts a result builder closure, overloading that initializer with the same argument labels could lead to bad error messages upon interor type errors.
In the DSL syntax as described in the first version of this proposal, there was a problem with the use of leading-dot syntax for character classes and other "atoms" and the builder syntax:
Regex {
.digit
OneOrMore(.whitespace)
}
worked as expected, but:
Regex {
OneOrMore(.whitespace)
.digit
}
did not, because .digit
parses as a property on OneOrMore
rather than a regex component. This could have been resolved by making people use either semicolons:
Regex {
OneOrMore(.whitespace);
.digit
}
or parentheses:
Regex {
OneOrMore(.whitespace)
(.digit)
}
Instead we decided to introduce the quantifier One
to resolve the ambiguity:
Regex {
OneOrMore(.whitespace)
One(.digit)
}
This increase the API surface, which is mildly undesirable, but feels much more stylistically consistent with the rest of the DSL and with Swift as whole. We also considered a "two protocol" approach that would force the use of One
in these cases by making it impossible to use the dot-prefixed "atoms" within builder blocks, but this seems like too much heavy machinery to resolve the problem.
While ChoiceOf
and quantifier types provide a general way of creating alternations and quantifications, we recognize that some synctactic sugar can be useful for creating one-liners like in textual regexes, e.g. infix operator |
, postfix operator *
, etc.
// The following functions implement what would be possible with variadic
// generics (using imaginary syntax) as a single function:
//
// public func | <
// R0: RegexComponent, R1: RegexComponent,
// WholeMatch0, WholeMatch1,
// Capture0..., Capture1...
// >(
// _ r0: RegexComponent,
// _ r1: RegexComponent
// ) -> Regex<(Substring, Capture0?..., Capture1?...)>
// where R0.RegexOutput == (WholeMatch0, Capture0...),
// R1.RegexOutput == (WholeMatch1, Capture1...)
@_disfavoredOverload
public func | <R0, R1>(lhs: R0, rhs: R1) -> Regex<Substring> where R0: RegexComponent, R1: RegexComponent {
public func | <R0, R1, W1, C0>(lhs: R0, rhs: R1) -> Regex<(Substring, C0?)> where R0: RegexComponent, R1: RegexComponent, R1.RegexOutput == (W1, C0)
public func | <R0, R1, W1, C0, C1>(lhs: R0, rhs: R1) -> Regex<(Substring, C0?, C1?)> where R0: RegexComponent, R1: RegexComponent, R1.RegexOutput == (W1, C0, C1)
// ... `O(arity^2)` overloads.
However, like RegexComponentBuilder.buildPartialBlock(accumulated:next:)
, operators such as |
, +
, *
, .?
require a large number of overloads to work with regexes of every capture arity, compounded by the fact that operator type checking is prone to performance issues in Swift. Here is a list of
Opreator | Meaning | Required number of overloads |
---|---|---|
Infix | |
Choice of two | O(arity^2) |
Postfix * |
Zero or more eagerly | O(arity) |
Postfix *? |
Zero or more reluctantly | O(arity) |
Postfix *+ |
Zero or more possessively | O(arity) |
Postfix + |
One or more eagerly | O(arity) |
Postfix +? |
One or more reluctantly | O(arity) |
Postfix ++ |
One or more possessively | O(arity) |
Postfix .? |
Optionally eagerly | O(arity) |
Postfix .?? |
Optionally reluctantly | O(arity) |
Postfix .?+ |
Optionally possessively | O(arity) |
When variadic generics are supported in the future, we may be able to define one function per operator and reduce type checking burdens.
An earlier iteration of regex builder declared capture
and tryCapture
as methods on RegexComponent
, meaning that you can append .capture(...)
to any subpattern within a regex to capture it. For example:
Regex {
OneOrMore {
r0.capture()
r1
}.capture()
} // => Regex<(Substring, Substring, Substring)>
However, there are two shortcomings of this design:
-
When a subpattern to be captured contains multiple components, the developer has to explicitly group them using a
Regex { ... }
block.let emailPattern = Regex { let word = OneOrMore(.word) Regex { // <= Had to explicitly group multiple components ZeroOrMore { word "." } word }.capture() "@" Regex { word OneOrMore { "." word } }.capture() } // => Regex<(Substring, Substring, Substring)>
-
When there are nested captures, it is harder to number the captures visually because the order
capture()
appears is flipped in the postfix (method) notation.let emailSuffixPattern = Regex { "@" Regex { word OneOrMore { "." word.capture() // top-level domain (.0) } }.capture() // full domain (.1) } // => Regex<(Substring, Substring, Substring)> // // full domain ^~~~~~~~~ // top-level domain ^~~~~~~~~
In comparison, prefix notation (
Capture
andTryCapture
as a types) makes it easier to visually capture captures as you can number captures in the order they appear from top to bottom. This is consistent with textual regexes where capturing groups are numbered by the left parenthesis of the group from left to right.let emailSuffixPattern = Regex { Capture { // full domain (.0) word OneOrMore { "." Capture(word) // top-level domain (.1) } } } // => Regex<(Substring, Substring, Substring)> // // full domain ^~~~~~~~~ // top-level domain ^~~~~~~~~
Since Repeat
is the most general version of quantifiers, one could argue for all quantifiers to be unified under the type Repeat
, for example:
Repeat(oneOrMore: r)
Repeat(zeroOrMore: r)
Repeat(optionally: r)
However, given that one-or-more (+
), zero-or-more (*
) and optional (?
) are the most common quantifiers in textual regexes, we believe that these quantifiers deserve their own type and should be written as a single word instead of two. This can also reduce visual clutter when the quantification is used in multiple places of a regex.
One could argue that type such as OneOrMore<Output>
could be defined as a top-level function that returns Regex
. While it is entirely possible to do so, it would lose the name scoping benefits of a type and pollute the top-level namespace with O(arity^2)
overloads of quantifiers, capture
, tryCapture
, etc. This could be detrimental to the usefulness of code completion.
Another reason to use types instead of free functions is consistency with existing result-builder-based DSLs such as SwiftUI.
To support if
statements, an earlier iteration of this proposal defined buildEither(first:)
, buildEither(second:)
and buildOptional(_:)
as the following:
extension RegexComponentBuilder {
public static func buildEither<
Component, WholeMatch, Capture...
>(
first component: Component
) -> Regex<(Substring, Capture...)>
where Component.RegexOutput == (WholeMatch, Capture...)
public static func buildEither<
Component, WholeMatch, Capture...
>(
second component: Component
) -> Regex<(Substring, Capture...)>
where Component.RegexOutput == (WholeMatch, Capture...)
public static func buildOptional<
Component, WholeMatch, Capture...
>(
_ component: Component?
) where Component.RegexOutput == (WholeMatch, Capture...)
}
However, multiple-branch control flow statements (e.g. if
-else
and switch
) would need to be required to produce either the same regex type, which is limiting, or an "either-like" type, which can be difficult to work with when nested. Unlike ChoiceOf
, producing a tuple of optionals is not an option, because the branch taken would be decided when the builder closure is executed, and it would cause capture numbering to be inconsistent with conventional regex.
Moreover, result builder conditionals does not work the same way as regex conditionals. In regex conditionals, the conditions are themselves regexes and are evaluated by the regex engine during matching, whereas result builder conditionals are evaluated as part of the builder closure. We hope that a future result builder feature will support "lifting" control flow conditions into the DSL domain, e.g. supporting Regex<Bool>
as a condition.
With the proposed design, ChoiceOf
with AlternationBuilder
wraps every component's capture type with an Optional
. This means that any ChoiceOf
with optional-capturing components would lead to a doubly-nested optional captures. This could make the result of matching harder to use.
ChoiceOf {
OneOrMore(Capture(.digit)) // RegexOutput == (Substring, Substring)
Optionally {
ZeroOrMore(Capture(.word)) // RegexOutput == (Substring, Substring?)
"a"
} // RegexOutput == (Substring, Substring??)
} // RegexOutput == (Substring, Substring?, Substring???)
One way to improve this could be overloading quantifier initializers (e.g. ZeroOrMore.init(_:)
) and AlternationBuilder.buildPartialBlock
to flatten any optionals upon composition. However, this would be non-trivial. Quantifier initializers would need to be overloaded O(2^arity)
times to account for all possible positions of Optional
that may appear in the Output
tuple. Even worse, AlternationBuilder.buildPartialBlock
would need to be overloaded O(arity!)
times to account for all possible combinations of two Output
tuples with all possible positions of Optional
that may appear in one of the Output
tuples.
We propose inferring capture types in such a way as to align with the traditional numbering of backreferences. This is because much of the motivation behind providing regex in Swift is their familiarity.
If we decided to deprioritize this motivation, there are opportunities to infer safer, more ergonomic, and arguably more intuitive types for captures. For example, to be consistent with traditional regex backreferences quantifications of multiple or nested captures had to produce parallel arrays rather than an array of tuples.
OneOrMore {
Capture {
OneOrMore(.hexDigit)
}
".."
Capture {
OneOrMore(.hexDigit)
}
}
// Flat capture types:
// => `RegexOutput == (Substring, Substring, Substring)>`
// Structured capture types:
// => `RegexOutput == (Substring, (Substring, Substring))`
Similarly, an alternation of multiple or nested captures could produce a structured alternation type (or an anonymous sum type) rather than flat optionals.
This is cool, but it adds extra complexity to regex builder and it isn't as clear because the generic type no longer aligns with the traditional regex backreference numbering. We think the consistency of the flat capture types trumps the added safety and ergonomics of the structured capture types.
The primary difference between Capture
and TryCapture
at the API level is that TryCapture
's transform closure returns an Optional
of the target type, whereas Capture
's transform closure returns the target type. TryCapture
would cause the regex engine to backtrack when the transform closure returns nil, whereas Capture
does not backtrack.
It has been argued in the review thread that the distinction between Capture
and TryCapture
need not be reflected at the type name level, but could be differentiated by argument label, e.g. transform:
/tryTransform:
or map:
/compactMap:
. However, doing so may cause ambiguity in cases where the transform closure is not the second, but the first, trailing closure in the initializer.
extension Capture {
public init<R: RegexComponent, W, NewCapture>(
_ component: R,
map: @escaping (Substring) throws -> NewCapture
) where Output == (Substring, NewCapture), R.RegexOutput == W
public init<R: RegexComponent, W, NewCapture>(
_ component: R,
compactMap: @escaping (Substring) throws -> NewCapture?
) where Output == (Substring, NewCapture), R.RegexOutput == W
}
In this case, since the argument label will not be specified for the first trailing closure, using Capture
where the component is a non-builder-closure may cause type-checking ambiguity.
Regex {
Capture(OneOrMore(.digit)) {
Int($0)
} // Which output type, `(Substring, Substring)` or `(Substring, Substring?)`?
}
Spelling out TryCapture
also has the benefit of clarity, as it makes clear that a capture's transform closure can cause the regex engine to backtrack. Since backtracking can be expensive, one could choose to throw errors instead and use a normal Capture
.
Regex {
Capture(OneOrMore(.digit)) {
guard let number = Int($0) else {
throw MyCustomParsingError.invalidNumber($0)
}
return number
}
}