Skip to content

Add StringUtils.truncateToByteLength #1392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

kiddos
Copy link

@kiddos kiddos commented May 27, 2025

We sometimes need to store Unicode text in a fixed space (e.g., in a database column of type CHARACTER(32)). It's acceptable for the text to be truncated, but because we're dealing with Unicode, we can't simply treat the text as raw bytes and truncate it at 16 bytes — that might split a character in the middle. The function StringUtils.truncateToByteLength(String str, int maxBytes, Charset charset) helps handle this by safely truncating the string based on byte length while preserving valid character boundaries.

@ecki
Copy link

ecki commented May 28, 2025

Agree, very useful when dealing with UTF8 databases. Wonder if it should have a utf8 variant, where it does not have to re truncate, it can just look at the byte patterns at the border.

The current version does not deal with UTF16 code units properly. (Substring might cut them in half)

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello all,

I think you'll want tests that cover grapheme clusters to avoid problems like https://issues.apache.org/jira/browse/LANG-1770

@kiddos
Copy link
Author

kiddos commented May 28, 2025

I added some test cases for emoji characters 🚀✨🎉
I did some testing and found that current implementation the escape characters worked ("\uD83D\uDE80\u2728\uD83C\uDF89")
but "🚀✨🎉" doesn't

After adding <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> in pom.xml, "🚀✨🎉" seems to work.

@garydgregory
Copy link
Member

@kiddos
Please see my previous comment.

@kiddos
Copy link
Author

kiddos commented May 28, 2025

Oh, right.
it's just tricky to handle grapheme cluster.
the codePoint solution you mention does seems to work.
I'll add more tests using grapheme clusters.

@garydgregory
Copy link
Member

I'm not requesting support for grapheme cluster in the runtime, but we should set expectations in unit tests, whether they are supported or not. This is a larger discussion, which I raised in https://issues.apache.org/jira/browse/LANG-1770

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants