-
Notifications
You must be signed in to change notification settings - Fork 523
Normalize request path to NFC and remove/resolve dot segments (#273). #573
Conversation
Ping. |
I'm curious to see how this affects perf for requests that don't need normalization (e.g. the plaintext benchmark). |
|
||
[ConditionalFact] | ||
[FrameworkSkipCondition(RuntimeFrameworks.Mono, SkipReason = "Test hangs after execution on Mono.")] | ||
public async Task RequestPathIsNormalized() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: add test with PathBase.
It shouldn't effect it much as i.e. it only does it to % encoded strings |
There shouldn't be much of an effect as long as |
@@ -775,6 +775,11 @@ protected bool TakeStartLine(SocketInput input) | |||
// URI was encoded, unescape and then parse as utf8 | |||
pathEnd = UrlPathDecoder.Unescape(pathBegin, pathEnd); | |||
requestUrlPath = pathBegin.GetUtf8String(pathEnd); | |||
|
|||
if (PathNormalizer.NeedsNormalization(requestUrlPath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needDecode and NeedsNormalization are now out of sync because dots don't trigger needDecode. Add a functional test because I don't think this code is actually executing right now.
@benaadams Actually I was wrong to apply the normalization only to the path with percent-encoded characters in the URL. Since normalization comprises of normalization to NFC + dot segment removal, I have to check for the need for normalization in the plain text path too (because there might be dot segments in the request path). |
a2d48ce
to
de913c4
Compare
Comments addressed, perf test still pending. |
There's a small perf hit in the plain text benchmark: Before:
After:
Ratio between worst-after and best-before is 0.9786. |
The hit is significant when decoding + normalization are needed: Before:
After:
|
@benaadams NFC normalization will only take place in paths with percent-encoded characters. I'm making a small change that might make a difference - I'll only try to remove dot segments if I detect any. This will save cycles and allocations. |
{ | ||
if (path.IndexOf('/') > -1) | ||
{ | ||
var normalizedChars = new char[path.Length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System.Buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or jump in before the path string is created and work on the byte buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with it. Any pointers/examples?
Do you have any data for normalizing |
@halter73 Gathering some. |
path = new string(normalizedChars, normalizedIndex, normalizedChars.Length - normalizedIndex); | ||
} | ||
|
||
if (!path.IsNormalized(NormalizationForm.FormC)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason dot compression and unicode normalization have to be run at the same time. If you separate them then you can limit the unicode normalization to only run in the needDecode scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, will change.
Got some perf back by avoiding dot segment removal when not necessary:
@Tratcher suggested a better change which I'll implement now. |
New plain text numbers. A veeery small improvement with suggested changes:
|
return path; | ||
} | ||
|
||
private static bool ContainsDotSegments(string path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use MemoryPool2Iterator2.Seek
to find any '/' or '.' characters? I think we need to ask @troydai if this could work and if any utf8 characters could be normalized to a '/' or a '.'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check back on find to see if previous byte has high byte set? (e.g. >= 128)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benaadams I don't follow. How would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CesarBS actually you don't need to back check; can remove dots before, after or during ut8 encoding
@halter73 for Url Path Encoding %2E
is a valid .
, however, in utf8 bytes only .
is .
and only /
is /
Good table at start of https://en.wikipedia.org/wiki/UTF-8#Description
- Backward compatibility: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that ASCII text is valid UTF-8, and UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.
- Clear distinction between multi-byte and single-byte characters: Code points larger than 127 are represented by multi-byte sequences, composed of a leading byte and one or more continuation bytes. The leading byte has two or more high-order 1s followed by a 0, while continuation bytes all have '10' in the high-order position.
And later as part of advantages
UTF-8 uses the codes 0–127 only for the ASCII characters. This means that UTF-8 is an ASCII extension and can be processed by software that supports 7-bit characters and assigns no meaning to non-ASCII bytes.
We're hoping to move the path/dotsegment normilization logic to https://github.com/aspnet/FileSystem |
Changing
|
Path and dot segment isn't just about file systems though. So it needs to happen higher up. Also we need to have an unnormalized property on the request for people who want to do really weird things. |
@blowdart can we design the GetRawUrl feature later? We'd need to decide how it flows through the entire stack. |
Sure, it can wait till after RC2 |
Ok, @CesarBS file a separate bug for that |
Filed #594. |
@@ -325,5 +325,8 @@ void IHttpRequestLifetimeFeature.Abort() | |||
{ | |||
Abort(); | |||
} | |||
|
|||
// TODO: remove before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget!
|
ae6a3d0
to
747fdca
Compare
0e2b9ec
to
abc10a0
Compare
abc10a0
to
1209eca
Compare
#273
cc @Tratcher @halter73 @blowdart