Skip to content

Fixes #2182. Fix output of non-BMP Unicode characters in NetDriver. #2183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 4, 2022

Conversation

cpacejo
Copy link
Contributor

@cpacejo cpacejo commented Nov 5, 2022

Fixes #2182. Breaks non-BMP Unicode characters into their surrogate pairs to work properly with System Console.

Widths for the test characters are wrong, but this is due to unrelated issue with NStack #86.

Pull Request checklist:

  • I've named my PR in the form of "Fixes #issue. Terse description."
  • My code follows the style guidelines of Terminal.Gui - if you use Visual Studio, hit CTRL-K-D to automatically reformat your files before committing.
  • My code follows the Terminal.Gui library design guidelines
  • I ran dotnet test before commit
  • I have made corresponding changes to the API documentation (using /// style comments)
  • My changes generate no new warnings
  • I have checked my code and corrected any poor grammar or misspellings
  • I conducted basic QA to assure all features are working

@BDisp
Copy link
Collaborator

BDisp commented Nov 5, 2022

@cpacejo I appreciate your contribution and for discovering of this issue. Before I make the changes in the NStack repo, the CharaterMap scenario in the UICatalog, which I ask you to test there your PR, was rendering very badly and with incorrect buffer size. At the time I realize that were a pattern where the bad rendering was happens. The pattern I found is in the range of the changes I have made, which was the least significant 16 bits. The gethexaformat or bitwise c & 0xFFFF is badly in the MakePrintable method. With your issue it seems that it fall into surrogate pairs which the resulting rune by itself is a wide char, but where reading as sequential separated runes result in a single width.
You don't need to do that changes in the UpdateScreen method, because the bug is here, which affect all the drivers and in the AddRune of each driver it must be added each surrogate pair into each separated column:

public static Rune MakePrintable (Rune c)
{
var controlChars = c & 0xFFFF;
if (controlChars <= 0x1F || controlChars >= 0X7F && controlChars <= 0x9F) {
// ASCII (C0) control characters.
// C1 control characters (https://www.aivosto.com/articles/control-characters.html#c1)
return new Rune (controlChars + 0x2400);
}
return c;
}

After I did the changes above and considering theses runes as single char the saga continues, see image bellow. Is the same problem that I was getting before, but now the right char is printed, thanks to you:

On NetDriver:
imagem

On WindowsDriver:
imagem

On CursesDriver:
imagem

If you want I can submitted a PR to your branch and maybe we could work together in this fix. Let me know, please.

@BDisp
Copy link
Collaborator

BDisp commented Nov 6, 2022

Somebody know how to write surrogate pairs with Curses.addwstr?

I already tried with the below but none print anything. Maybe it doesn't supported:

Curses.addwstr (new string (spair));
Curses.addwstr ($"'{spair [0]}''{spair [1]}'");
Curses.addwstr ($"\'{spair [0]}'\'{spair [1]}'");
Curses.addwstr ($"\\'{spair [0]}'\\'{spair [1]}'");

@cpacejo
Copy link
Contributor Author

cpacejo commented Nov 6, 2022

Thanks for looking into this.

I don't understand var controlChars = c & 0xFFFF; -- why should, for example, U+10009 (a Linear B character, "𐀉") be treated the same as U+0009 (TAB)?

in the AddRune of each driver it must be added each surrogate pair into each separated column

Non-BMP characters aren't necessarily wide -- if you split them into two separate columns, I think that would make those two columns print as only one (narrow) character width on screen, making the line too short. Just like in the screen shot you show.

I think it makes sense to store the full code point (even if it is non-BMP) in a single column, just like is already done. The code point just needs to be sent to the console driver in a way that the driver can understand. System.Text.StringBuilder and System.Console expect a char, which (in .NET) is a 16-bit type which represents a UTF-16 character, hence the Unicode code point must be encoded as UTF-16 (using surrogate pairs) to be sent to it.

However on Linux in C, wchar_t (as used by curses) is a 32-bit UTF-32 value, which means surrogate pairs cannot be used (they exist only for UTF-16). I tested just now in a C program, addwstr works correctly with UTF-32 characters. (It is necessary to setlocale(LC_ALL, "") first, which I don't understand.) e.g.:

    wchar_t str[] = { 0x41, 0x211d, 0x1d53d, 0x21, 0 };
    addwstr(str);

displays "Aℝ𝔽!"

I don't know how this translates to the Terminal.Gui wrapper for curses, which seems to use a string (UTF-16) argument for addwstr, which doesn't make sense to me.

(I never use the curses driver, as it just results in black and white squares on my screen, which I haven't had time to debug.)

@BDisp
Copy link
Collaborator

BDisp commented Nov 6, 2022

Yes you are right it's bad. I now see it's a hack for wide char. The problem is how to encode a surrogate pair in one column as you said, because in WindowsDriver and CursesDriver I can't handle on the output as you did in the NetDriver. Thanks.

@BDisp
Copy link
Collaborator

BDisp commented Nov 6, 2022

@cpacejo please confirm that the spacing in each char in the image must exist. If yes, then it must be used the superview color to avoid the black square and also be considered as wider, otherwise the line would be shorter and misaligned.

imagem

@cpacejo
Copy link
Contributor Author

cpacejo commented Nov 6, 2022

No, there should be no spaces between those characters. As I mentioned in NStack bug #86, I believe this is occurring because Rune.ColumnWidth is treating characters like U+1D539 as U+D539 for example.

@cpacejo
Copy link
Contributor Author

cpacejo commented Nov 6, 2022

The problem is how to encode a surrogate pair in one column as you said, because in WindowsDriver and CursesDriver I can't handle on the output as you did in the NetDriver.

Looking at WindowsDriver, it seems it would have to be rewritten somewhat to follow a similar model as NetDriver. Or modify WindowsConsole.CharUnion to be able to store a surrogate pair. I can take a stab at this, but I don't often have access to a Windows machine to test on so it may take a few days.

@BDisp
Copy link
Collaborator

BDisp commented Nov 6, 2022

So, the bug is in the ConsoleDriver on the MakePrintable method that should be like this:

public static Rune MakePrintable (Rune c)
{
	if (c <= 0x1F || (c >= 0X7F && c <= 0x9F)) {
		// ASCII (C0) control characters.
		// C1 control characters (https://www.aivosto.com/articles/control-characters.html#c1)
		return new Rune (c + 0x2400);
	}

	return c;
}

@tig
Copy link
Collaborator

tig commented Nov 7, 2022

FWIW (and I'm just 🍿 here), I noticed that even VS2020 has issues with glyphs in this range:

image

Y'all are doing important work on this issue. Carry on!

@BDisp
Copy link
Collaborator

BDisp commented Nov 14, 2022

FWIW (and I'm just 🍿 here), I noticed that even VS2020 has issues with glyphs in this range:

image

Y'all are doing important work on this issue. Carry on!

Well I already done many tests and it seems that the U+10000..10FFFF range is considered wider code points. I also got rid the gethexaformat method. I only got all aligned if I consider as wider the range above.
But this is going cause the TextFormatter.HotKeyTagMask which is equal to U+100000, to transform the hotkeys, fall to this range, falsifying the column width. I have to consider reviewing this.

@BDisp
Copy link
Collaborator

BDisp commented Nov 18, 2022

I already got WindowsDriver work with surrogate pairs but it only renders well on the Windows Host Console

windowsdriver-console-host

On Windows Terminal it isn't render well. I hope someone of the Windows Terminal team can really test if surrogate pairs is working well indeed.

windowsdriver-windows-terminal

Now I'm going trying fix the CursesDriver.

@BDisp
Copy link
Collaborator

BDisp commented Nov 20, 2022

Now I'm going trying fix the CursesDriver.

CursesDriver still don't show the code points in the U+10000...U+10FFFF range. On a C program I can print on a console by using ncursesw flag to the compiler but not yet with the Terminal.Gui.

With WindowsDriver on the Windows Terminal the misalignment is always present event I consider the width equal to one, by commenting the range. You can test by using the pr gui-cs/NStack#89.

imagem

imagem

@BDisp
Copy link
Collaborator

BDisp commented Nov 21, 2022

With CursesDriver only ℝ (U+211D), which isn't a surrogate pair, appear:

imagem

Copy link
Collaborator

@tig tig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But note debugging terminal apps in my VS2022 install is currently broken so I didn't test this myself. (I'm waiting on the Terminal team for a fix they want me to try).

Before I merge this can y'all confirm you really believe it's ready to merge? (@BDisp @cpacejo)

@BDisp
Copy link
Collaborator

BDisp commented Dec 3, 2022

LGTM. But note debugging terminal apps in my VS2022 install is currently broken so I didn't test this myself. (I'm waiting on the Terminal team for a fix they want me to try).

Before I merge this can y'all confirm you really believe it's ready to merge? (@BDisp @cpacejo)

Only fixes the NetDriver, but no harm will be done if you merge it.

@tig
Copy link
Collaborator

tig commented Dec 4, 2022

So it does NOT completely close #2182?

@BDisp
Copy link
Collaborator

BDisp commented Dec 4, 2022

So it does NOT completely close #2182?

Yes it closes because its only mention the System.Console

@tig tig merged commit bafb4bb into gui-cs:develop Dec 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-BMP characters display incorrectly with System Console driver
3 participants