Do not filter keys that use codepoints outside the BMP when listing #986

pepijnve · 2020-04-15T09:46:34Z

This PR is related to #980.

When an upper bound (lt or lte) is not provided, level-sublevel uses a default upperbound of <= 0xFFFF. This incorrectly filters out any keys using codepoints outside the BMP because the keys in leveldb are UTF-8 encoded and leveldb uses binary comparison when comparing keys.

A similar issue exists in Arsenal itself where correct key ordering is verified in an extra filter on the keys during listing.

This PR resolves the first issue by avoiding the usage of the default level-sublevel filter.
The second issue is resolved by using a custom string comparison function that compare codepoint values instead of UTF-16 chars. This is effectively the same as UTF-8 binary ordering.

I'm not sure if the default filter avoidance is in the correct place. Currently it's applied to all backend that use the list algorithms; perhaps this should be pushed down to file backend specific code?

…rver If a less-than(-equal) filter is not provided level-sublevel will add a default one which uses '\uffff' as upperbound. This incorrectly filters out all codepoints outside the BMP. By always passing an explicit filter this default behaviour is avoided. Signed-off-by: Pepijn Van Eeckhoudt <[email protected]>

The default String comparison operators compare Strings using UTF-16 binary ordering. Using this ordering codepoints in the BMP region after the UTF-16 surrogate pair region are incorrectly considered to be greater than codepoints outside the BMP. Note that the default String <= is still used under the assumption that this type of keys will be rare and the <= will be faster than utf8Compare. This way the extra cost of utf8Compare only applies in the rare cases where these ordering issues do occur. Signed-off-by: Pepijn Van Eeckhoudt <[email protected]>

miniscruff · 2020-04-15T17:42:25Z

lib/algos/list/tools.js

+ * @param {String} s2 the second string to compare
+ * @return {number} -1, 0, or 1 if s1 is less than, equal or greater than s2 respectively
+ */
+function utf8Compare(s1, s2) {


Hi, me again. This seems very similar to what localeCompare probably does, have you tried using that? https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare

If I understand it correctly localeCompare does locale sensitive collation. That's not what you want here; it needs to be locale independent UTF-8 binary order.

I had another look at the description on MDN and the two pointers to the relevant Ecma specifications. There's a lot of room for 'implementation dependent behaviour' which makes localeCompare unsuitable for this usage. Amazon is fairly vague on this, but https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html does state

List results are always returned in UTF-8 binary order.

It seems like the optional arguments will have this type of comparison but I can not find the exact one as I don't fully understand the problem.

You're going to go down a rabbit hole here 😄

Looking at V8 you end up at https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/builtins/builtins-string.cc#L135
which, assuming ICU support has been compiled in, delegates to
https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/objects/intl-objects.cc#L947
which in turn creates an ICU collator and then compares using that.

Making some assumption here, but you're most likely going get a Unicode collator that uses the system's default locale if you do not specify any options. You can find the details of what that is exactly at https://www.unicode.org/reports/tr10/#Main_Algorithm. In a nutshell this is sorting for humans taking language/culture specific details into account. That's not what you want here.
I had hoped to be able to find a locale code or an option value that says 'I want codepoint order', but as far as I can tell that does not exist.

UTF-8 binary order means encoding both strings as UTF-8 and comparing byte-wise. Thanks to the way UTF-8 is constructed this is identical to the ordering you get when you take the codepoint values of the string and compare those integer-wise.

You might be wondering why the default UTF-16 comparison isn't good enough. Best way I can explain that is with an example.
Take codepoints 48, 65104 and 129648. That's 0,﹐and 🩰 respectively.
Sorting in codepoint order you get them in exactly that order.

If you encode in UTF-16 you get

0x0030 0xFE50 0xD83E 0xDE70

respectively. UTF-16 binary sorting compares each 16-bit value one by one so that would return the string in the order [0, 🩰, ﹐].

Encode in UTF-8 and you get

0x30 0xEF 0xB9 0x90 0xF0 0x9F 0xA9 0xB0

respectively. UTF-8 binary sorting compares each 8-bit value one by one and then you get [0,﹐,🩰] which is what you want.

This is by design in UTF-8. Binary sort order for UTF-8 is identical to codepoint order.

Yea that's fair, I had assumed it was something similar to that affect but was hoping there was something built in as this seems like it would be easy and common enough. I am new to scality so I won't be able to merge this but its got my approval if that means anything.

lib/algos/list/tools.js

Signed-off-by: Pepijn Van Eeckhoudt <[email protected]>

rahulreddy · 2020-05-09T17:02:19Z

@pepijnve I will ask the team to review this, one more review I will merge it in the branch. You can squash your commits after that 😃

ilkescality

LGTM

pepijnve added 2 commits April 15, 2020 10:28

pepijnve changed the title ~~Do not filter keys using codepoints outside the BMP when listing~~ Do not filter keys that use codepoints outside the BMP when listing Apr 15, 2020

miniscruff reviewed Apr 15, 2020

View reviewed changes

lib/algos/list/tools.js Show resolved Hide resolved

miniscruff reviewed Apr 15, 2020

View reviewed changes

lib/algos/list/tools.js Outdated Show resolved Hide resolved

miniscruff reviewed Apr 15, 2020

View reviewed changes

lib/algos/list/tools.js Outdated Show resolved Hide resolved

pepijnve mentioned this pull request Apr 16, 2020

Correct UTF-16 surrogates URI encoding #980

Closed

Fix eslint errors

5924b0a

Signed-off-by: Pepijn Van Eeckhoudt <[email protected]>

ilkescality approved these changes May 12, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not filter keys that use codepoints outside the BMP when listing #986

Do not filter keys that use codepoints outside the BMP when listing #986

Uh oh!

pepijnve commented Apr 15, 2020

Uh oh!

miniscruff Apr 15, 2020

Uh oh!

pepijnve Apr 15, 2020

Uh oh!

pepijnve Apr 15, 2020

Uh oh!

miniscruff Apr 15, 2020

Uh oh!

pepijnve Apr 15, 2020 •

edited

Loading

Uh oh!

miniscruff Apr 15, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahulreddy commented May 9, 2020

Uh oh!

ilkescality left a comment

Uh oh!

Uh oh!

Do not filter keys that use codepoints outside the BMP when listing #986

Are you sure you want to change the base?

Do not filter keys that use codepoints outside the BMP when listing #986

Uh oh!

Conversation

pepijnve commented Apr 15, 2020

Uh oh!

miniscruff Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

pepijnve Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

pepijnve Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

miniscruff Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

pepijnve Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miniscruff Apr 15, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahulreddy commented May 9, 2020

Uh oh!

ilkescality left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pepijnve Apr 15, 2020 •

edited

Loading