Skip to content
This repository was archived by the owner on Dec 18, 2018. It is now read-only.

Normalize request path to NFC and remove/resolve dot segments (#273). #573

Merged
merged 1 commit into from
Jan 29, 2016

Conversation

cesarblum
Copy link
Contributor

@cesarblum
Copy link
Contributor Author

Ping.

@halter73
Copy link
Member

I'm curious to see how this affects perf for requests that don't need normalization (e.g. the plaintext benchmark).


[ConditionalFact]
[FrameworkSkipCondition(RuntimeFrameworks.Mono, SkipReason = "Test hangs after execution on Mono.")]
public async Task RequestPathIsNormalized()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: add test with PathBase.

@benaadams
Copy link
Contributor

It shouldn't effect it much as needDecode would be false in that case and it should just skip it?

i.e. it only does it to % encoded strings

@halter73
Copy link
Member

There shouldn't be much of an effect as long as PathNormalizer.NeedsNormalization isn't too expensive. I would like to know for sure that's the case.

@@ -775,6 +775,11 @@ protected bool TakeStartLine(SocketInput input)
// URI was encoded, unescape and then parse as utf8
pathEnd = UrlPathDecoder.Unescape(pathBegin, pathEnd);
requestUrlPath = pathBegin.GetUtf8String(pathEnd);

if (PathNormalizer.NeedsNormalization(requestUrlPath))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needDecode and NeedsNormalization are now out of sync because dots don't trigger needDecode. Add a functional test because I don't think this code is actually executing right now.

@cesarblum
Copy link
Contributor Author

@benaadams Actually I was wrong to apply the normalization only to the path with percent-encoded characters in the URL. Since normalization comprises of normalization to NFC + dot segment removal, I have to check for the need for normalization in the plain text path too (because there might be dot segments in the request path).

@cesarblum cesarblum force-pushed the cesarbs/normalize-request-path branch from a2d48ce to de913c4 Compare January 19, 2016 22:40
@cesarblum
Copy link
Contributor Author

Comments addressed, perf test still pending.

@cesarblum
Copy link
Contributor Author

There's a small perf hit in the plain text benchmark:

Before:

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.67ms   12.42ms 187.86ms   94.60%
    Req/Sec    34.92k     3.16k   61.86k    77.44%
  11197736 requests in 10.10s, 1.38GB read
Requests/sec: 1108681.79
Transfer/sec:    139.57MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.96ms    8.65ms 127.63ms   91.75%
    Req/Sec    34.84k     3.09k   44.80k    72.59%
  11194786 requests in 10.10s, 1.38GB read
Requests/sec: 1108504.52
Transfer/sec:    139.54MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.66ms   22.11ms 378.68ms   89.15%
    Req/Sec    34.24k     3.55k   59.86k    73.48%
  10997567 requests in 10.10s, 1.35GB read
Requests/sec: 1088929.91
Transfer/sec:    137.08MB

After:

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.54ms   22.43ms 334.46ms   93.61%
    Req/Sec    33.75k     3.49k   64.74k    78.30%
  10834047 requests in 10.10s, 1.33GB read
Requests/sec: 1072774.90
Transfer/sec:    135.05MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.06ms   15.59ms 285.80ms   92.87%
    Req/Sec    33.99k     2.79k   51.70k    74.58%
  10911305 requests in 10.10s, 1.34GB read
Requests/sec: 1080355.79
Transfer/sec:    136.00MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.93ms   11.70ms 254.75ms   92.00%
    Req/Sec    33.52k     3.10k   49.13k    74.06%
  10763402 requests in 10.10s, 1.32GB read
Requests/sec: 1065676.09
Transfer/sec:    134.15MB

Ratio between worst-after and best-before is 0.9786.

@cesarblum
Copy link
Contributor Author

The hit is significant when decoding + normalization are needed:

Before:

Running 10s test @ http://10.0.0.100:5001/plaintext/%41%CC%8A
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.79ms    7.68ms 200.81ms   96.34%
    Req/Sec    29.86k     3.22k   51.04k    73.78%
  9585223 requests in 10.10s, 1.18GB read
Requests/sec: 949067.03
Transfer/sec:    119.47MB

Running 10s test @ http://10.0.0.100:5001/plaintext/%41%CC%8A
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.21ms    9.23ms 129.26ms   94.94%
    Req/Sec    30.78k     2.88k   46.97k    78.71%
  9882682 requests in 10.10s, 1.21GB read
Requests/sec: 978584.03
Transfer/sec:    123.19MB

Running 10s test @ http://10.0.0.100:5001/plaintext/%41%CC%8A
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.91ms   12.42ms 235.40ms   94.12%
    Req/Sec    31.24k     2.58k   49.29k    77.63%
  10036237 requests in 10.10s, 1.23GB read
Requests/sec: 993710.05
Transfer/sec:    125.09MB

After:

Running 10s test @ http://10.0.0.100:5001/plaintext/%41%CC%8A
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.75ms    3.77ms  94.38ms   94.26%
    Req/Sec    22.60k     1.53k   32.20k    79.50%
  7252043 requests in 10.10s, 0.89GB read
Requests/sec: 718004.23
Transfer/sec:     90.39MB

Running 10s test @ http://10.0.0.100:5001/plaintext/%41%CC%8A
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.56ms    3.10ms  58.15ms   91.99%
    Req/Sec    22.61k     1.54k   31.35k    77.02%
  7255969 requests in 10.10s, 0.89GB read
  Socket errors: connect 0, read 0, write 81, timeout 0
Requests/sec: 718417.96
Transfer/sec:     90.44MB

Running 10s test @ http://10.0.0.100:5001/plaintext/%41%CC%8A
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.51ms    2.89ms  54.24ms   88.69%
    Req/Sec    22.50k     1.49k   32.51k    78.47%
  7218541 requests in 10.10s, 0.89GB read
Requests/sec: 714747.15
Transfer/sec:     89.98MB

@cesarblum
Copy link
Contributor Author

@benaadams NFC normalization will only take place in paths with percent-encoded characters.

I'm making a small change that might make a difference - I'll only try to remove dot segments if I detect any. This will save cycles and allocations.

{
if (path.IndexOf('/') > -1)
{
var normalizedChars = new char[path.Length];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System.Buffers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or jump in before the path string is created and work on the byte buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with it. Any pointers/examples?

@halter73
Copy link
Member

Do you have any data for normalizing ../, or ./ in paths?

@cesarblum
Copy link
Contributor Author

@halter73 Gathering some.

path = new string(normalizedChars, normalizedIndex, normalizedChars.Length - normalizedIndex);
}

if (!path.IsNormalized(NormalizationForm.FormC))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason dot compression and unicode normalization have to be run at the same time. If you separate them then you can limit the unicode normalization to only run in the needDecode scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, will change.

@cesarblum
Copy link
Contributor Author

Got some perf back by avoiding dot segment removal when not necessary:

Running 10s test @ http://10.0.0.100:5001/plaintext/%41%CC%8A
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.91ms    5.80ms 173.92ms   95.48%
    Req/Sec    24.26k     1.88k   37.44k    75.29%
  7790526 requests in 10.10s, 0.96GB read
Requests/sec: 771378.77
Transfer/sec:     97.11MB

Running 10s test @ http://10.0.0.100:5001/plaintext/%41%CC%8A
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.52ms    4.20ms 102.13ms   94.37%
    Req/Sec    24.24k     1.69k   36.02k    77.64%
  7780465 requests in 10.10s, 0.96GB read
Requests/sec: 770358.86
Transfer/sec:     96.98MB

Running 10s test @ http://10.0.0.100:5001/plaintext/%41%CC%8A
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.63ms    4.17ms 113.08ms   93.35%
    Req/Sec    24.48k     1.76k   36.00k    74.83%
  7859579 requests in 10.10s, 0.97GB read
Requests/sec: 778191.01
Transfer/sec:     97.96MB

@Tratcher suggested a better change which I'll implement now.

@cesarblum
Copy link
Contributor Author

New plain text numbers. A veeery small improvement with suggested changes:

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.83ms   16.09ms 271.85ms   93.81%
    Req/Sec    34.11k     2.59k   48.42k    75.19%
  10949164 requests in 10.10s, 1.35GB read
Requests/sec: 1084253.42
Transfer/sec:    136.49MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.68ms   10.46ms 209.49ms   93.48%
    Req/Sec    34.01k     2.61k   47.20k    76.08%
  10923702 requests in 10.10s, 1.34GB read
Requests/sec: 1081589.99
Transfer/sec:    136.16MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.48ms   17.26ms 297.70ms   93.62%
    Req/Sec    33.99k     3.81k   60.87k    85.81%
  10857619 requests in 10.10s, 1.33GB read
Requests/sec: 1075094.28
Transfer/sec:    135.34MB

return path;
}

private static bool ContainsDotSegments(string path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use MemoryPool2Iterator2.Seek to find any '/' or '.' characters? I think we need to ask @troydai if this could work and if any utf8 characters could be normalized to a '/' or a '.'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check back on find to see if previous byte has high byte set? (e.g. >= 128)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benaadams I don't follow. How would that work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CesarBS actually you don't need to back check; can remove dots before, after or during ut8 encoding
@halter73 for Url Path Encoding %2E is a valid ., however, in utf8 bytes only . is . and only / is /

Good table at start of https://en.wikipedia.org/wiki/UTF-8#Description

  • Backward compatibility: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that ASCII text is valid UTF-8, and UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.
  • Clear distinction between multi-byte and single-byte characters: Code points larger than 127 are represented by multi-byte sequences, composed of a leading byte and one or more continuation bytes. The leading byte has two or more high-order 1s followed by a 0, while continuation bytes all have '10' in the high-order position.

And later as part of advantages

UTF-8 uses the codes 0–127 only for the ASCII characters. This means that UTF-8 is an ASCII extension and can be processed by software that supports 7-bit characters and assigns no meaning to non-ASCII bytes.

@halter73
Copy link
Member

We're hoping to move the path/dotsegment normilization logic to https://github.com/aspnet/FileSystem

@cesarblum
Copy link
Contributor Author

Changing ContainsDotSegments to use a pointer instead of an index improved things:

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.33ms   14.74ms 263.96ms   93.40%
    Req/Sec    35.23k     3.33k   61.66k    79.25%
  11212229 requests in 10.10s, 1.38GB read
  Socket errors: connect 0, read 0, write 542, timeout 0
Requests/sec: 1110137.07
Transfer/sec:    139.75MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.64ms   16.03ms 243.53ms   93.68%
    Req/Sec    35.10k     3.50k   63.18k    84.33%
  11226167 requests in 10.10s, 1.38GB read
Requests/sec: 1111526.44
Transfer/sec:    139.92MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.57ms   12.28ms 236.93ms   95.26%
    Req/Sec    34.36k     2.81k   52.99k    71.87%
  11025641 requests in 10.10s, 1.36GB read
Requests/sec: 1091646.87
Transfer/sec:    137.42MB

@blowdart
Copy link
Member

Path and dot segment isn't just about file systems though. So it needs to happen higher up.

Also we need to have an unnormalized property on the request for people who want to do really weird things.

@Tratcher
Copy link
Member

@blowdart can we design the GetRawUrl feature later? We'd need to decide how it flows through the entire stack.

@blowdart
Copy link
Member

Sure, it can wait till after RC2

@Tratcher
Copy link
Member

Ok, @CesarBS file a separate bug for that

@cesarblum
Copy link
Contributor Author

Filed #594.

@@ -325,5 +325,8 @@ void IHttpRequestLifetimeFeature.Abort()
{
Abort();
}

// TODO: remove before merging.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget!

@halter73
Copy link
Member

:shipit:

@cesarblum cesarblum force-pushed the cesarbs/normalize-request-path branch 2 times, most recently from ae6a3d0 to 747fdca Compare January 26, 2016 21:40
@cesarblum cesarblum force-pushed the cesarbs/normalize-request-path branch 2 times, most recently from 0e2b9ec to abc10a0 Compare January 28, 2016 22:24
@cesarblum cesarblum force-pushed the cesarbs/normalize-request-path branch from abc10a0 to 1209eca Compare January 28, 2016 23:30
@cesarblum cesarblum merged commit 1209eca into dev Jan 29, 2016
@cesarblum cesarblum deleted the cesarbs/normalize-request-path branch January 29, 2016 17:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants