Skip to content

Do not filter keys that use codepoints outside the BMP when listing #986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: development/8.1
Choose a base branch
from

Conversation

pepijnve
Copy link
Contributor

This PR is related to #980.

When an upper bound (lt or lte) is not provided, level-sublevel uses a default upperbound of <= 0xFFFF. This incorrectly filters out any keys using codepoints outside the BMP because the keys in leveldb are UTF-8 encoded and leveldb uses binary comparison when comparing keys.

A similar issue exists in Arsenal itself where correct key ordering is verified in an extra filter on the keys during listing.

This PR resolves the first issue by avoiding the usage of the default level-sublevel filter.
The second issue is resolved by using a custom string comparison function that compare codepoint values instead of UTF-16 chars. This is effectively the same as UTF-8 binary ordering.

I'm not sure if the default filter avoidance is in the correct place. Currently it's applied to all backend that use the list algorithms; perhaps this should be pushed down to file backend specific code?

…rver

If a less-than(-equal) filter is not provided level-sublevel will add a default one which uses '\uffff' as upperbound. This incorrectly filters out all codepoints outside the BMP. By always passing an explicit filter this default behaviour is avoided.

Signed-off-by: Pepijn Van Eeckhoudt <[email protected]>
The default String comparison operators compare Strings using UTF-16 binary ordering.
Using this ordering codepoints in the BMP region after the UTF-16 surrogate pair region are incorrectly considered to be greater than codepoints outside the BMP.

Note that the default String <= is still used under the assumption that this type of keys will be rare and the <= will be faster than utf8Compare. This way the extra cost of utf8Compare only applies in the rare cases where these ordering issues do occur.

Signed-off-by: Pepijn Van Eeckhoudt <[email protected]>
@pepijnve pepijnve changed the title Do not filter keys using codepoints outside the BMP when listing Do not filter keys that use codepoints outside the BMP when listing Apr 15, 2020
* @param {String} s2 the second string to compare
* @return {number} -1, 0, or 1 if s1 is less than, equal or greater than s2 respectively
*/
function utf8Compare(s1, s2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, me again. This seems very similar to what localeCompare probably does, have you tried using that? https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly localeCompare does locale sensitive collation. That's not what you want here; it needs to be locale independent UTF-8 binary order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another look at the description on MDN and the two pointers to the relevant Ecma specifications. There's a lot of room for 'implementation dependent behaviour' which makes localeCompare unsuitable for this usage. Amazon is fairly vague on this, but https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html does state

List results are always returned in UTF-8 binary order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the optional arguments will have this type of comparison but I can not find the exact one as I don't fully understand the problem.

Copy link
Contributor Author

@pepijnve pepijnve Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're going to go down a rabbit hole here 😄

Looking at V8 you end up at https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/builtins/builtins-string.cc#L135
which, assuming ICU support has been compiled in, delegates to
https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/objects/intl-objects.cc#L947
which in turn creates an ICU collator and then compares using that.

Making some assumption here, but you're most likely going get a Unicode collator that uses the system's default locale if you do not specify any options. You can find the details of what that is exactly at https://www.unicode.org/reports/tr10/#Main_Algorithm. In a nutshell this is sorting for humans taking language/culture specific details into account. That's not what you want here.
I had hoped to be able to find a locale code or an option value that says 'I want codepoint order', but as far as I can tell that does not exist.

UTF-8 binary order means encoding both strings as UTF-8 and comparing byte-wise. Thanks to the way UTF-8 is constructed this is identical to the ordering you get when you take the codepoint values of the string and compare those integer-wise.

You might be wondering why the default UTF-16 comparison isn't good enough. Best way I can explain that is with an example.
Take codepoints 48, 65104 and 129648. That's 0,﹐and 🩰 respectively.
Sorting in codepoint order you get them in exactly that order.

If you encode in UTF-16 you get

0x0030
0xFE50
0xD83E 0xDE70

respectively. UTF-16 binary sorting compares each 16-bit value one by one so that would return the string in the order [0, 🩰, ﹐].

Encode in UTF-8 and you get

0x30
0xEF 0xB9 0x90
0xF0 0x9F 0xA9 0xB0

respectively. UTF-8 binary sorting compares each 8-bit value one by one and then you get [0,﹐,🩰] which is what you want.

This is by design in UTF-8. Binary sort order for UTF-8 is identical to codepoint order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's fair, I had assumed it was something similar to that affect but was hoping there was something built in as this seems like it would be easy and common enough. I am new to scality so I won't be able to merge this but its got my approval if that means anything.

Signed-off-by: Pepijn Van Eeckhoudt <[email protected]>
@rahulreddy
Copy link
Collaborator

@pepijnve I will ask the team to review this, one more review I will merge it in the branch. You can squash your commits after that 😃

Copy link
Contributor

@ilkescality ilkescality left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants