-
-
Notifications
You must be signed in to change notification settings - Fork 7
Buffer.byteLength #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This greatly affects performance of e.g. |
So we have that
This ends up calling
Let us focus on...
The So you should be able to replace...
with the accelerated version... auto v = flat.ToUC16Vector();
utf8_length = simdutf::utf8_length_from_utf16(v.start(), v.length()) In a future release of simdutf, we could just do... int String::Utf8Length(Isolate* v8_isolate) const {
i::Handle<i::String> str = Utils::OpenHandle(this);
str = i::String::Flatten(reinterpret_cast<i::Isolate*>(v8_isolate), str);
int length = str->length();
if (length == 0) return 0;
i::DisallowGarbageCollection no_gc;
i::String::FlatContent flat = str->GetFlatContent(no_gc);
DCHECK(flat.IsFlat());
int utf8_length = 0;
if (flat.IsOneByte()) {
auto v = flat.ToOneByteVector();
return simdutf::utf8_length_from_latin(v.start(), v.length());
}
auto v = flat.ToUC16Vector();
return simdutf::utf8_length_from_utf16(v.start(), v.length());
} It looks easy enough but this code I am modifying is in v8, not in Node. If you don't want to modify v8, and I trust that you do not, then you need to intercept all of the You have some fun code in Node, like this...
Of course, The question is therefore entirely a software engineering question. We know how to write the code, the simdutf library is already in node. So how does one do it? |
Has anyone explored integrating simdutf (or some equivalent library) into V8? Also, you can effectively emulate |
I think doing anything in v8 will be difficult. |
We had a very productive collaboration with the @nodejs/v8 team before. It is of course a bit different to code there but I would favor to first try that. As such, let's discuss the possibility of integrating |
@BridgeAR Sure! Who can champion that? |
A memcpy is cheap, but if you need to do extra memory allocations just to scan the data, it is a non-starter. |
I think the problem is that flattening the strings (notice the |
Another way to solve it could be to add a new V8 API that returns whether the string is already flatten, and the fast API call gives up if it gets a false. Then in JS land we can do use some heuristics (like indexing into the strings) to flatten it before invoking the fast call. If that heuristics fails, we'd still go back to the slow branch. |
Right. So the function above, already in v8,
Isn't it what |
Maybe stupid question... but why flatten it at all? Why not just walk the string rope and get the size of each part? |
Sorry, I should've been clearer, #52 (comment) was about using V8 fast API calls for the method, not about switching from |
This is more of an V8 question, and my guess is when someone calls |
And regarding replacing unibrow with simdutf, there is also a (pretty old) plan to switch unibrow to ICU altogether: https://bugs.chromium.org/p/v8/issues/detail?id=5500 For non-Chromium embedders like Node.js though the easier upstream request might be to just extend the String API a bit so that whatever used to calculate the size of the result or do the actual conversion is a parameter to a new API that we can customize. So for example, this is more likely to be accepted: enum class ContentType {
kLatin1,
kUTF16LE,
};
using ByteLengthCalculator = int (*)(ContentType, const void* start, size_t size);
int String::Utf8Length(Isolate* v8_isolate, ByteLengthCalculator calculator) const {
i::Handle<i::String> str = Utils::OpenHandle(this);
str = i::String::Flatten(reinterpret_cast<i::Isolate*>(v8_isolate), str);
int length = str->length();
if (length == 0) return 0;
i::DisallowGarbageCollection no_gc;
i::String::FlatContent flat = str->GetFlatContent(no_gc);
DCHECK(flat.IsFlat());
int utf8_length = 0;
if (flat.IsOneByte()) {
auto one_byte_vector = flat.ToOneByteVector();
return calculator(ContentType::kLatin1,
one_byte_vector.begin(), one_byte_vector.size());
}
auto two_byte_vector = flat.ToUC16Vector();
return calculator(ContentType::kUTF16LE,
two_byte_vector.begin(), two_byte_vector.size());
}
|
if (flat.IsOneByte()) {
auto one_byte_vector = flat.ToOneByteVector();
return calculator(ContentType::kLatin1,
one_byte_vector.begin(), one_byte_vector.size());
} Is there any point with latin strings? Isn't it always a case of |
The original implementation calculates the length of any character > 127 as 2, my guess is that's part of the eternal one byte string contract (which can choose to merge two characters into one uint8_t to save space). But we'd need a proper review to find out. I can submit a CL to test the water - this seems fairly straight-forward, so I guess an issue isn't strictly necessary, though it might help to get a better design if we have some kind of doc to explain what's needed? (e.g. if we want to also avoid flattening). |
I think that one byte is always latin 1 (iso-8859-1). In that case, all character values greater or equal to 128 map to two bytes in UTF-8. That's true also of windows-1252, iso-8859-2 and iso-8859-9 (maybe there are others). I think that all of the iso- encodings are extensions of ASCII, so <=128 is just ASCII (one byte in UTF-8). There are several one-byte European formats like iso-8859-5, iso-8859-7, iso-8859-10, iso-8859-11 that translate into more than 2-byte UTF-8 characters some of the case. Obviously, non-European one-byte formats do that too. |
Your proposal is more useful than just computing bytes... Once you have access to the underlying bytes, you can copy them, transcode them, filter them... |
Actually there is another way to implement a fast path for sequential one byte strings - just use the V8 fast API calls for these (which can only be hit if the string is already a one byte string that does not require flattening). My local benchmark shows a ~30% performance improvement for something like |
The fundamental issue is not in byteLength, it's in how V8 handle js strings in memory. Strings are represented as a tree (so concatenation is fast), but counting the length or linearizing it for further operations is costly. Consider that we count the length of a string twice for HTTP responses (one for the headers, one for allocating the malloc before the actual libuv write). If we didn't flatten the actual string, we would have to walk the tree 3 times. Improving this was why we develop the flatstr module many years ago. The need for it was superseded once counting the byte length of a string started flattening it. |
So no fast call possible when the string is non-ASCII? |
Right, so the flattening on access is deliberate. |
It's the concatenation that's causing the issue, other strings that V8 currently have can be turned into a sequential one one way or another. A cons string needs to be flatten in some way before it can be exposed to the fast call, since it's unlikely that V8 would want to expose the internal structure to that. |
@lemire By the way does simdutf implements anything that does what the hypothetical |
@joyeecheung We will be adding support for latin 1, as it has been requested by various other users of the library: not only length computations but also transcoding. It is not present today. A SIMD-based |
Does simdutf + V8 fast calls do anything for
Buffer.byteLength
which currently can be a bottleneck for e.g. http servers which are writing string to the response object.The text was updated successfully, but these errors were encountered: