-
Notifications
You must be signed in to change notification settings - Fork 44
Description
The repo needs files which include grammars for the attributes that require evaluation of their strings.
These attributes are:
- For
<add>
:arr1
arr2
cond
vercond
arg
(?)
For each of these, the grammar would answer:
- What is a valid identifier?
- What is a valid number?
- What is a valid operation?
- What is a valid order of operations?
There are additional things to formalize outside of the grammar itself, such as having multiple grammars specialized for the various attributes. For example, logical operations are not valid in arr1
and arr2
, or are they? Since we haven't formalized this, there is no actual answer. Right now the only operations that makes arr1
and arr2
expressions are arithmetic though extremely limited (only 1-2 uses iirc).
I will elaborate on this in the comments, since discussing it at the top of the ticket will be likely overwhelming. :)
Identifier
The rules for what a valid identifier is are fairly simple. It's whatever makes a valid name
for any of the XML tags, plus a few reserved keywords like the ARG and TEMPLATE tokens. However the latter does not apply to all of the attributes equally.
Identifier Regex
The current regex matches any sequence of alphanumeric (plus :
) words separated by a single space.
- Like C and other languages, the identifier cannot start with 0-9. Unlike other languages, underscore is NOT valid for a starting character.
- Colon has to be an exception because several FO4 RTTI include colons, e.g.
BSConnectPoint::Parents
. Colon is also used in some name attributes already. - Dashes/hyphens which were present in name attributes had to be removed, as
-
is an arithmetic operator. - Question marks which were present in name attributes also had to be removed. There is no reason to make a name of a field a question.
Rigorous
\b[A-Za-z]+(?:[\:]{0,2}?\s{0,1}?\w)+
Edit: I have added \b
to the start as it is more explicit about requiring alpha at the start. Without it it would incorrectly match 0x10000
.
Disallows:
- Any character except
A-Z
,a-z
,0-9
,_
and:
. - Identifiers not starting with an alphabetical character ([A-Za-z])
- More than one space between words (2+ spaces is considered separate identifiers)
- More than two colons in a row
- Trailing or leading whitespace
Simplified
\b[A-Za-z](?:[\w\:]+\s*)+
Almost exactly half the steps in this case, but the output would have to be trimmed.
Allows:
- Trailing colons (BAD)
- Trailing space (BAD)-ish
- 3+ sequential colons (BAD)-ish
Number
Version Regex
Gamebryo versions are considered a number because they are merely a way of representing each component bit shifted into an analogous number, best seen in hex, e.g. 20.2.0.7
=> 0x14020007
.
It gets complicated because to support versions 2.3, 3.0, 3.03, and 3.1 would require a regex that can also match floats. However, none of these versions are currently used in vercond
. They are only referred to in ver1
and ver2
which are not expressions.
Simplest 4-Component or 2+ Component
[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,3}
[0-9]{1,2}\.[0-9]{1,2}[\.]?[0-9]{0,2}[\.]?[0-9]{0,3}
Allows:
- Trailing periods (BAD)
- 3-component versions which do not exist e.g.
3.0.1
(BAD) - Versions that start with a 0
- Matches well outside the range of currently valid for each field, e.g.
99.99.99.999
- Matches substrings of much larger groups of numbers and periods, e.g.
99.99.99.999.99.99.99.999
Somewhat Rigorous
[0-9]{1,2}\.[0-9]{1,2}(?:[\.][0-9]{0,2}[\.][0-9]{0,3})?
Disallows:
- Trailing periods
- 3-component versions e.g.
3.0.1
- Versions that start with a 0
Allows:
- Matches well outside the range of currently valid for each field, e.g.
99.99.99.999
- Matches substrings of much larger groups of numbers and periods, e.g.
99.99.99.999.99.99.99.999
Rigorous
(?<![\w\.])[1-9]{1}[0-9]?\.[0-9]{1,2}(?:\.[0-9]{0,2}\.[0-9]{0,3})?(?![\w\.])
The number of steps (in Python at least) blows up with both a negative lookbehind and lookahead. In PCRE (php) it is only ~500 steps. Uncached, both take about 300ms, but even cached, Python takes ~5ms which is quite slow.
Disallows:
- Matching of an entire version if preceded or succeeded immediately by a word character or a dot.
Pedantic
I'm not patient enough to write this one, but it would include minimizing the value ranges of each component to what exists in reality, kind of like limiting IPs to 255.255.255.255 among other limitations.
Numeric Regexes
Decimal
\d+(?!\.)
Any series of 1 or more digits that are not directly succeeded by a dot (in order to not match floats).
Hexadecimal
0[xX][0-9a-fA-F]+
Binary
0b[01]+
Float
Simple
[-+]?[0-9]+\.[0-9]+
This version of the regex could erroneously match 4-component Version identifiers.
Rigorous
(?<![\w\.])[-+]?[0-9]*[\.][0-9]+(?:[eE][-+]?[0-9]+)?(?:[fF])?(?![\w\.])
This still doesn't apply to the rarer formats 0.-0. 0.f -0.f 1e10 1e-5
, but no one really writes them that way anyway.
Other Identifiers
TEMPLATE and ARG
These are currently special key words and should formally be considered a token. Their use outside of the intended areas (which also aren't currently defined) should error. These two tokens are also a good example of why vercond
at least needs its own separate grammar (aside from arithmetic making no sense there).
-
TEMPLATE is not actually used in any attributes with expressions, only
type
andtemplate
, but it technically could be. Should it?- Example:
cond="TEMPLATE == NiProperty"
i.e. template specialization.
- Example:
-
TEMPLATE and ARG should probably take on the token delimiter syntax proposed in [Spec] Token definitions #70 . It would make things consistent and all strings could be split on tokens via one single regex
Operators
For ease of reading I will make tables of what is in use and what operators do and do not need to be implemented that aren't already being used.
Arithmetic
Addition, Multiplication, and Division are all used. Subtraction used to be but was actually unnecessary for that field.
In Use | ✔️ | ❌ |
---|---|---|
+ * / |
- % |
++ -- = |
Logical and Bitwise
In Use | ✔️ | ❌ |
---|---|---|
&& || ! |
||
& | |
<< >> |
^ ~ |
Member Access
With the introduction of BSVertexDesc
, a compound that is attempting to mimic a bitfield that spans a 64-bit integer, a child of BSVertexDesc
needed to be passed into another compound via the shared parent.
<add name="Vertex Desc" type="BSVertexDesc" />
<add name="Vertex Data" type="BSVertexData" arg="Vertex Desc\Vertex Attributes" />
Most programming languages use .
for this, but with our identifiers having spaces, it just looks awkward. (Also, in this particular case, the real answer is to actually introduce a real bitfield type, which is relevant to #3 and the Flags type.)
However, a member access operator is still potentially useful and should be formalized. It could potentially be used nearly everywhere because User Version 2
is actually incorrect. The header should look something like this:
<compound name="BSStreamHeader" versions="#BETHESDA#">
<add name="Version" type="ulittle32" />
<add name="Export Info" type="ExportInfo" />
</compound>
<compound name="Header">
<!-- -->
<add name="Num Blocks" type="ulittle32" ver1="3.1.0.1" />
<add name="BSStream" type="BSStreamHeader" cond="#BSVERSIONS#" />
<!-- -->
</compound>
As this is exactly what happens in their engine (minus the conditions on the header version of course, since it's assumed).
Then anywhere there is User Version 2
it would be replaced with BSStream\Version
.
Global Variables
This brings up the next issue, which is globally available variables. Right now, all vercond
assume access to the following identifiers in some way or another:
Version
User Version
User Version 2
(i.e.BSStream\Version
)
Since they have special meaning, they should probably be tokenized (see #70), and any parser should see the token and replace it with the correct value. This is of course only for vercond
--furthering the need for two grammars.
These variables are referenced in cond
but only in the Header compound since they are actually local variables in that case. Which brings up the next thing in need of specification: Do we make the version values explicitly unavailable to cond
? I could contrive uses for having them in cond
(as I mentioned I will illustrate in a comment below), but these mostly involve using ternary expressions, something that we also haven't specified. The important thing is that cond
is meant for dynamically changing its size and read/write status based on the values of its antecedent siblings.
Therefore the assumption will be that these global values are only available in the grammar specific to vercond
, as special keywords.
Grammar Format/Notation
I have already nearly finished writing the grammars for both cond
and vercond
using a variant of PEG (Packrat) notation. Specifically, this variant can be parsed by Arpeggio for use by nifxml.py.
We do not have to decide on only one format however. A grammars folder has been added to the repo root and any formats we would like to support can be added to it for each grammar. If the files are not actually parsed directly by a library, etc. then at least the grammar is there for reference.
Grammar Requirements for Each Attribute
Operators
cond |
vercond |
arr* |
arg |
|
---|---|---|---|---|
Logical | ✔️ | ✔️ | ❌ | ❌ |
Relational | ✔️ | ✔️ | ❌ | ❌ |
Arithmetic | ✔️ | ❌ | ✔️ | ✔️ |
Bitwise | ✔️ | ❌ | ✔️ | ✔️꙳ |
Unary Not (! ) |
✔️ | ✔️ | ❌ | ❌ |
Unary Minus (- ) |
✔️ | ❌ | ❌ | ❌ |
Grouping ( ) |
✔️ | ✔️ | ✔️ | ✔️ |
Member access | ✔️ | ❌ | ✔️ | ✔️ |
꙳: In the BSVertexDesc
case, some masking and bit shifting on a plain int64 inside of arg
would have been enough to pass the required information to the compound. So, bitwise could be nice to introduce for arg
.
Operands
cond |
vercond |
arr* |
arg |
|
---|---|---|---|---|
Local Ident | ✔️ | ❌ | ✔️ | ✔️ |
Global Ident꙳ | ❌ | ✔️ | ❌ | ❌ |
Version-as-int | ❌ | ✔️ | ❌ | ❌ |
Integer | ✔️ | ✔️ | ✔️ | ✔️ |
Float | ✔️ | ❌ | ❌ | ✔️ |
#ARG# ꙳ |
✔️ | ❌ | ✔️ | ✔️ |
#TEMPLATE# ꙳ |
❌ | ❌ | ❌ | ❌ |
꙳: Any columns with an ❌ for this identifier/token means that the grammar would implement these as reserved keywords, but do nothing with them (or error, etc.).
As can be seen by comparing the columns, each attribute needs its own specific grammar even excluding the possible support of bitwise ops in array and args still differ in allowed operands.arg
,
Edit: It was overlooked that arr1
already uses bitwise operators for UV Sets.
Example Grammars (WIP)
Still largely WIP, and I haven't implemented some operators yet, like modulus.
Cond (Gist)
Currently has the entire superset of all the grammars, i.e. nothing has been implemented that is not allowed in it that is allowed in other grammars.
- Normally global keywords like
Version
andUser Version
appear incond
only in theHeader
compound. These keywords should be reserved and unavailable for use incond
, but that requires morphing them into tokens such as#VER#
and#USER#
for use invercond
. - Decide if literal floats be used at all in expressions. Will elaborate below.
There's a very odd case of floating point comparison required for BSLightingShaderProperty
for FO4:
<add name="Backlight Power" type="float" cond="Rimlight Power == 0x7F7FFFFF" ... />
To assure an exact comparison, the RHS is written as essentially hex bytes, and it is assumed that the expression parser convert the float to hex bytes as well for the LHS.
- Should comparing a float with a non-float receive its own special non-terminal in the grammar? This way it is clear that something special is being done.
- Should byte-for-byte comparisons receive an entirely separate operator?
- Should the use of float literals actually be removed from the grammar as they have no use in arithmetic, etc. (for the purposes of nif.xml)?
Vercond (Gist)
As discussed, only has relational and logical operators. No member access operator, bitwise, or arithmetic operators.
Current Issues:
- Must not allow the entire string to only be a version keyword or a version number, must have both LHS/RHS
- Disallow use of
Version
,User Version
, etc. by reserving as keywords. - Replace them with tokens such as
#VER#
and#USER#
and that way they are unambiguous. - Alternate version identifier using ID values formalized in [Spec] Version tag changes #69 in the format
VXX_X_X_X
. This allows any language, interpreted or generated, to skip converting theXX.X.X.X
format to an integer and use the enumeration value directly.
Arr (Gist)
No relational or logical operators. No floating point numbers. Has member access operators, bitwise, and arithmetic.
Arg
Haven't approached this with any grammar yet. Still not totally sure if it will be necessary.
Note
This ticket is WIP and I will be updating it more this evening, and adding the examples of the PEG grammar once I finalize some things. I do have Arpeggio working fully with the grammars I've defined, and it will be replacing the expression parsing in nifxml.py.
I have yet to elaborate on:
- What is a valid order of operations? (I currently have *an* order, but order of operations has some subtleties to discuss)
- A summary of what is actually changing vs what is just being formalized, i.e. what repercussions this has on existing projects.
The newSee checklists above.<version>
ID attribute (Issue [Spec] Version tag changes #69), and it becoming its own identifier format in addition to--or maybe replacing--theXX.X.X.X
format used now. Essentially, the version has a unique identifier and can be referenced directly now, even for version ranges, which is simply all the<version>
between two specified IDs.See checklists above.Version
vs#VERSION#
(or#VER#
, etc.) in expressions. That is, we need to make Version, User Version, and BSStream Version reserved keywords or tokens, and also disallow them incond
because it can't have globals. But in theHeader
compound, they are local and need to be used as locals, and so can't be reserved. Thus all theVersion
etc. invercond
probably need to be tokenized and I will elaborate on that later.