Skip to content

gitk: Fixing file name encoding issues. #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

kkato233
Copy link
Contributor

fix: file name encoding issues.
fix: when resolving merge conflicts, japanese file names become garbled.

Cc: Johannes Sixt [email protected]

@kkato233
Copy link
Contributor Author

Steps to reproduce the issue.

echo "ABC" > ABC.txt
echo "あいう" > あいう.txt
git add *
git commit -m "ABC あいう"

gitk

pict 2025-02-04 231612

When using

gitk ABC.txt

the history is displayed,

pict 2025-02-04 232713

but when using

gitk あいう.txt

the history is not displayed.

pict 2025-02-04 232732

fix this.

pict 2025-02-04 234644

@kkato233
Copy link
Contributor Author

When resolving merge conflicts, Japanese file names become garbled.

Branch A: Change from 1 to 2
Branch B: Change from 1 to 3
When merging Branch A and Branch B and resolving conflicts,
the diff display for that file becomes incorrect.

git checkout -b master
echo "あいう" > あいう.txt
git add *
git commit -m "add あいう"

git checkout -b A
echo "123" >> あいう.txt
git add *
git commit -m "add 123"

git checkout -b B master
echo "456" >> あいう.txt
git add *
git commit -m "add 456"

git merge A
... fix conflict ...

git add *
git commit -m "fix comflict"

gitk

invalid filename

pict-2-ng

fix then.

pict-3-fix

@kkato233
Copy link
Contributor Author

kkato233 commented Feb 19, 2025

link git/git#1886

@j6t
Copy link
Owner

j6t commented Feb 19, 2025

When using

gitk ABC.txt

the history is displayed.
but when using

gitk あいう.txt

the history is not displayed.

I cannot reproduce this behavior. The file changes are displayed in both cases. My locale here is en_US.utf8. The file name is encoded in UTF-8 (I assume), and I am running the commands from an xterm. What is different in our setup?

@avih
Copy link
Contributor

avih commented Feb 19, 2025

Steps to reproduce the issue.

echo "ABC" > ABC.txt
echo "あいう" > あいう.txt
git add *
git commit -m "ABC あいう"

gitk

On windows, this doesn't have an issue for me.

All 3 places marked in red at the image show the same text here (I can't read it, but it looks the same text like at the browser in this issue).

When using

gitk ABC.txt

the history is displayed.

but when using

gitk あいう.txt

the history is not displayed.

This does happen for me too. Specifically, the diff at the bottom pane is missing only when using the Japanese file name.

@j6t
Copy link
Owner

j6t commented Feb 19, 2025

So, this is Windows-specific? What is the explanation for this behavior? Is it that the diff command issued by Gitk does not produce output, because it did not pass a file name to git that is in the expected encoding? And why does it help to override the Tcl system encoding? I am asking because the solution does not look sound and papers over the actual problem.

@avih
Copy link
Contributor

avih commented Feb 19, 2025

I don't know which platform OP uses.

I only did a quick test on windows, and not with gitk master.

I'll test again on windows and linux with several combinations of shell and gitk, and report back, hopefully soon.

@j6t
Copy link
Owner

j6t commented Feb 19, 2025

This review system is b... cr.... unusable. I'll paste the commented diff here:

@@ -8205,12 +8205,13 @@ proc parseblobdiffline {ids line} {
 
         if {$type eq "--cc"} {
             # start of a new file in a merge diff
-            set fname [string range $line 10 end]
+            set fname_raw [string range $line 10 end]
+            set fname [encoding convertfrom $fname_raw]
             if {[lsearch -exact $treediffs($ids) $fname] < 0} {
                 lappend treediffs($ids) $fname
                 add_flist [list $fname]
             }
-
+            set fname $fname_raw

This part populates treediffs, because in merge commits, we do not run gettreediffs (which does it for normal diffs). But to do that, we need the encoded filename. But then later we have to revert to the raw version, because the makediffhdr that is called later expects the raw version.

I would prefer to have the raw name in fname and a new variable with the encoded version that is then used in the if clause.

Please explain why this change is needed:

@@ -8310,6 +8311,7 @@ proc parseblobdiffline {ids line} {
             set diffinhdr 0
             return
         }
+        set line [encoding convertfrom $line]
         $ctext insert end "$line\n" filesep
 
     } else {

It is about renamed from, renamed to, etc. Are these affected, too? If so, this is not about merge diffs. Please either make a new commit with this change with its own justification.

@j6t
Copy link
Owner

j6t commented Feb 19, 2025

Please look at the history of this repository and the commit titles. Use the prefix gitk: for every commit. Also, write the problem statement in the commit message.

Generally, write the status quo (where the bug exists) in present tense, and then state what has to change in imperative mood.

@j6t
Copy link
Owner

j6t commented Feb 19, 2025

Part of this change may become unnecessary once #4 is merged. If you want, you can base your changes on that branch. Notice that the encoding commands now call for utf-8 explicitly.

@avih
Copy link
Contributor

avih commented Feb 19, 2025

Is it that the diff command issued by Gitk does not produce output, because it did not pass a file name to git that is in the expected encoding?

I don't know, but if you tell me which git commands to run at the command line (posix shell), I can tell you what the output is. Feel free to request as many commands as you think would help you get a better understanding.

So far I confirmed that gitk master has the same result as my report above (gitk without args show correctly as far as I can tell, gitk ABC.txt shows the commit message+diff of ABC.txt, but the diff is missing if using the jap filename as argument).

I also applied the commit "gitk: Unicode file name support" from #4, but that did not change the result - the diff with the jap filename argument is still missing.

(I'm not seeing other commits in that PR which could affect encoding issues).

EDIT: If I apply the two commits in this PR instead of the commit from #4, then it DOES fix the missing diff issue with the jap file name arg (I can't judge the correctness of this PR, but the result here seems as intended).

EDIT2: the first commit in this PR alone (only adding encoding system utf-8) is enough to fix the missing diff issue for me. I didn't test merge conflicts which the 2nd commit is supposed to address.

So far all the above are on Windows using the unicode variant of busybox-w32 shell+env (and "git" itself at the PATH is from the git-for-windows project).

tcl/tk are 8.6.16 which I cross-compiled on Linux for windows.

Specifically, I also didn't yet test with the git-for-windows bash environment (I don't expect differences, but I'll report once I test that and linux too).

@avih
Copy link
Contributor

avih commented Feb 20, 2025

Specifically, I also didn't yet test with the git-for-windows bash environment (I don't expect differences, but I'll report once I test that and linux too).

Yeah, as expected, git-bash.exe from the git-for-windows project (with tcl/tk which comes with it) behaves identically to as described above, including the two "EDIT" comments.

That's git 2.46.2, same version which was used with the tests above using the busybox-w32 shell.

The missing diff issue also happens with gitk which is shipped with that git-for-windows (which I presume is slightly behind gitk master currently).

But as noted, same missing diff with gitk master, and same fixed using the 1st commit in this PR.

On linux (Ubuntu, git 2.34.1, all the non-empty LC_* values are en_US.UTF-8) I can't reproduce the missing diff issue at all. Not with gitk which ships with the system git, and not with gitk master.

I didn't test the merge conflict case with any system, and I don't have further data points to share, unless requested specifically.

@j6t
Copy link
Owner

j6t commented Feb 20, 2025

@avih Thank you for doing the tests.

I patched Gitk to tell the commands it runs. Here are the interesting ones:

/usr/local/bin/git rev-list -1 660e68dd06a8f2a2139afd30682f0e18108f3b6b -- あいう.txt
/usr/local/bin/git diff-index --cached --ignore-submodules=dirty HEAD -- あいう.txt

(Replace the commit ID by HEAD.)

The issue is that the Japanese text enters Gitk as a command line argument. I don't know whether Tcl picks the Unicode version of the arguments on Windows and keeps them, or if it picks the "ANSI" version and translates them to Unicode. You are in big trouble if it is the latter, but the "ANSI" codepage is unable to hold the Unicode characters typed into the terminal. I am trying to understand if that is the problem.

But I still do not understand why encoding system utf-8 helps: by the time it is set, the command line arguments will already have been processed and this setting cannot help to read them correctly. Also, commit 5eb02dd explicitly mentions that setting the system encoding is discouraged by Tcl. I could find this Tcl wiki page, while the manual doesn't say anything in this respect.

As long as there is no clear reasoning why we must dismiss such a warning, a change that overrides the system encoding is not acceptable.

@kkato233
Copy link
Contributor Author

kkato233 commented Feb 20, 2025

@j6t

But I still do not understand why encoding system utf-8

Based on the information from https://www.tcl.tk/man/tcl/TclCmd/encoding.htm

encoding system ?encoding?
Set the system encoding to encoding. 
If encoding is omitted then the command returns the current system encoding. 
The system encoding is used whenever Tcl passes strings to system calls.

In the Gitk code around line 12000, add

encoding system utf-8

the following before

set tclencoding [tcl_encoding $gitencoding]:

This ensures that command line parameters passed when calling Git commands are treated as UTF-8 instead of SJIS (or ANSI) .

@j6t
Copy link
Owner

j6t commented Feb 20, 2025

In the Gitk code around line 12000, add

encoding system utf-8

the following before

set tclencoding [tcl_encoding $gitencoding]:

This ensures that command line parameters passed when calling Git commands are treated as UTF-8 instead of SJIS (or ANSI) .

How so? proc tcl_encoding just converts an encoding name to one that Tcl understands. Variable tclencoding is only used to convert output of git commands to Tcl strings (Unicode); typically it is already UTF-8. Neither of these care about the encoding to use for system calls.

Furthermore, Gitk itself does not make system calls where the names of files in the repository are used. All interaction with the repository goes via git commands that are invoked via the proc open. The encoding is only relevant when the constructed commands are transformed into a fork+exec on Unix or a CreateProcess on Windows.

Ah, indeed, things may go wrong at this stage. The commands may still look correct on the Tcl side (like those I posted above). How do these strings arrive in Git? Are they corrupted without encoding system utf-8?

The commit 5eb02dd seems to have the same effect as

encoding system utf-8

Now I am confused. Does it fix the problem that we are trying to solve here?

@avih
Copy link
Contributor

avih commented Feb 20, 2025

I patched Gitk to tell the commands it runs. Here are the interesting ones:

/usr/local/bin/git rev-list -1 660e68dd06a8f2a2139afd30682f0e18108f3b6b -- あいう.txt
/usr/local/bin/git diff-index --cached --ignore-submodules=dirty HEAD -- あいう.txt

(Replace the commit ID by HEAD.)

So, I started with this sequence, which the the first reproduce sequence above, but more complete to ensure initial state (that's also how I tested before, but adding now for completeness. I don't know whether the fact that only 1 commit exists affects this issue):

rm -rf tmp && mkdir tmp && cd tmp
git init
echo "ABC" > ABC.txt
echo "あいう" > あいう.txt
git add *
git commit -m "ABC あいう"
gitk

And it does show the two modified files (both at the files pane, and at the diff pane), with correct names as far as I can tell. I.e. both files were added correctly to the repo.

For reference, I then ran git show, which does show correctly the two modified/added files and their diff.

Then I used your commands, but with HEAD instead of an explicit hash (exactly as shown below, at the shell prompt '$', let me know if you think explicit hash is needed instead):

$ git rev-list -1 HEAD -- あいう.txt
f3431b1756435a4bddb3627064c9ab34fbd54dd9

$ git diff-index --cached --ignore-submodules=dirty HEAD -- あいう.txt

$ echo $?
0

So it "succeeded", but assuming the last command is supposed to show the diff of that file, it's empty.

Just as a reference, I tried also the other file name:

$ git diff-index --cached --ignore-submodules=dirty HEAD -- ABC.txt

$ echo $?
0

So this also shows an empty diff?!

I presumed this command is supposed to show the diff (which gitk would then display at the diff pane), right?

If not, can you tell me which command gitk executes in order to get the dif part when running gitk <japanese filename> ?

(or better yet, share your patch to add debug messages of the invoked git commands).

The same complete sequence and identical result (up to the hash printed from git rev-list ..., and without the rm -rf in cmd.exe - which doesn't support this form) happens in all tested windows environments:

  • busybox-w32 unicode shell (ash, posix, own-compiled tcl/tk).
  • git-for-windows git-bash.exe environment (bash, posix, shipped tcl/tk).
  • git-for-windows git-cmd.exe environment (cmd.exe, windows, shipped tcl/tk).

So this probably rules out differences in environments which use git and gitk, as well as different tcl/tk builds, at least as far as the missing-diff issue is concerned.

@avih
Copy link
Contributor

avih commented Feb 20, 2025

Also for reference: git log あいう.txt does show the commit, while git log あいう.txtXXX gives an error, both as expected, so at least in these cases (and above where the name is used from CLI), the name appears to be taken correctly by git.

@j6t
Copy link
Owner

j6t commented Feb 20, 2025

@avih Appologies, the commands that actually produce the patch text are:

/usr/local/bin/git diff-tree -r -p --textconv --submodule -C --cc --no-commit-id -U3 --root 660e68dd06a8f2a2139afd30682f0e18108f3b6b -- ABC.txt
/usr/local/bin/git diff-tree -r -p --textconv --submodule -C --cc --no-commit-id -U3 --root 660e68dd06a8f2a2139afd30682f0e18108f3b6b -- あいう.txt

I dismissed them on my first reading because I usually have the option Limit diff to listed paths disabled, and then there was no file name.

The diff is simply

diff --git a/gitk b/gitk
index d18a11c..d5944fa 100755
--- a/gitk
+++ b/gitk
@@ -138,6 +138,7 @@ proc open {args} {
     if {[string range $arg0 0 0] eq "|"} {
         set command_line [string trim [string range $arg0 1 end]]
         lset args 0 "| [sanitize_command_line $command_line 0]"
+        puts "$args"
     }
     uplevel 1 real_open $args
 }

and then you invoke ./gitk (or wish ./gitk on Windows).

However, the interesting question is how the command arguments actually arrive in git.exe.

@avih
Copy link
Contributor

avih commented Feb 20, 2025

the commands that actually produce the patch text are:

/usr/local/bin/git diff-tree -r -p --textconv --submodule -C --cc --no-commit-id -U3 --root 660e68dd06a8f2a2139afd30682f0e18108f3b6b -- ABC.txt
/usr/local/bin/git diff-tree -r -p --textconv --submodule -C --cc --no-commit-id -U3 --root 660e68dd06a8f2a2139afd30682f0e18108f3b6b -- あいう.txt

Assuming this hash is HEAD (and HEAD is what I used at the actual command), both commands at the CLI print the correct diff.

Not sure if relevant, but it works for me also without --textconv (tested the japanese name).

The diff is simply ...
and then you invoke ./gitk (or wish ./gitk on Windows).

Thanks. I applied it and indeed I see the used commands printed at the terminal while running tclsh path/to/modified/gitk ....

For future reference, using wish instead of tclsh did not print the commands at the console. I'm guessing because wish on windows is a GUI application (not console) so its stdout goes someplace else. At least apparently not to the console from which it was invoked.

Anyway, the output with the japanese name argument is interesting. With current gitk master (4a6cc6a), this name appears in gibberish form, same in all places:

{| git.exe log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin {<<f3431b1756435a4bddb3627064c9ab34fbd54dd9
--
ã‚ã„ã†.txt}} r
{| git.exe rev-list -1 f3431b1756435a4bddb3627064c9ab34fbd54dd9 -- ã‚ã„ã†.txt} r
{| git.exe diff-index --cached --ignore-submodules=dirty HEAD -- ã‚ã„ã†.txt} r
.git/gitk.cache r
{| git.exe rev-list --parents --all} r
{| git.exe diff-tree -r --no-commit-id --root f3431b1756435a4bddb3627064c9ab34fbd54dd9 -- ã‚ã„ã†.txt} r
{| git.exe diff-files -- ã‚ã„ã†.txt} r
{| git.exe diff-tree -r -p --textconv --submodule -C --cc --no-commit-id -U3 --root f3431b1756435a4bddb3627064c9ab34fbd54dd9 -- ã‚ã„ã†.txt} r```

Note that there's no line-wrapping of the first command. These seem like actual two newlines inside the output of that command.

This is the bug use case, and the diff is indeed missing.

Anyway, next.

If I also add the first commit in this PR (that one line system encoding), then it shows correctly at the console:

{| git.exe log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin {<<f3431b1756435a4bddb3627064c9ab34fbd54dd9
--
あいう.txt}} r
{| git.exe rev-list -1 f3431b1756435a4bddb3627064c9ab34fbd54dd9 -- あいう.txt} r
{| git.exe diff-index --cached --ignore-submodules=dirty HEAD -- あいう.txt} r
.git/gitk.cache r
{| git.exe rev-list --parents --all} r
{| git.exe diff-tree -r --no-commit-id --root f3431b1756435a4bddb3627064c9ab34fbd54dd9 -- あいう.txt} r
{| git.exe diff-index --cached --ignore-submodules=dirty HEAD -- あいう.txt} r
{| git.exe diff-files -- あいう.txt} r
{| git.exe diff-tree -r -p --textconv --submodule -C --cc --no-commit-id -U3 --root f3431b1756435a4bddb3627064c9ab34fbd54dd9 -- あいう.txt} r
~/.config/git/gitk-tmp {WRONLY CREAT EXCL}

And now the diff does show at the gitk windows.

Another interesting data point:

If I use gitk master, where the filename at the output appears gibberish, but pipe it into cat, like so: tclsh .. gitk ... | cat then all the gibberish becomes correct like with the encoding patch applied, but obviously the diff is still missing at the winow.

FYI, cat in the unicode busybox-w32 environment (which is where I'm testing currently), when its output is a console, then it assumes that its input is UTF8, and formats it correctly so that it appears as Unicode at the console.

So it looks like that puts at gitk master prints explicitly a UTF8 stream (which does not display correctly if the console codepage is not UTF8, like in my case, it's CP 437), but with the system encoding patch applied it prints something which displays correctly at the windows console regardless of the default console codepage.

(obviously we don't strictly care about the console output, but it may indicate some difference with that patch applied, either with the value or representation of the arguments to puts, or in how puts uses them for the actual printout)

I'm not sure how to interpret that, but hopefully you would have some insights.

EDIT: another data point: with the encoding patch applied where the console output shows correctly without cat and the diff shows at the pane, if I pipe it into cat, then it becomes ???? at the console. I.e. in this case, at least when stdout is not a tty, puts does not print a UTF8 stream. No idea what it does print...

However, the interesting question is how the command arguments actually arrive in git.exe.

Yes. I can test more things if you want me, but I think we can rule out a bug in git where it fails to accept unicode names as arguments, and I think the evidence suggest some encoding mis-handling (or a tcl bug) between gitk and tcl/tk.

@avih
Copy link
Contributor

avih commented Feb 20, 2025

Also, not sure if applicable here, but regarding the newlines at the first command, windows does not consider newline as argument separator when parsing command line (you can't test it from cmd.exe, but you can with the CreateProcess API).

So maybe it's worth removing these newlines, if they are indeed included inside one or more of the argumets which gitk uses when invoking git.

@avih
Copy link
Contributor

avih commented Feb 20, 2025

Unrelated concretely to the code here, here's a roughly accurate overview of encoding subjects with windows:

  • Windows internally uses Unicode for everything, and it has Unicode APIs. This covers command line argument into the applications, arguments used in windows APIs, and those include invoking another process with Unicode arguments.
  • Internally, Windows Unicode encoding is UTF16-LE, and consequently the Windows wchar_t is 16 bits. Obviously this does not matter normally when using APIs, because the exact encoding doesn't matter. What matters is that it covers Unicode fully, and normally nothing gets lost due to encoding issues (non-BMP codepoints are encoded as surrogate pairs etc - full Unicode range). All the windows Unicode APIs use wchar_t arguments for strings.
  • There are two other types of string encodings on windows: "ANSI" and "MBCS". MBSC is some kind of multi-byte which I think is mainly used with legacy east-asian locales, and I know very little about it. ANSI is byte-based (normal C char strings). For whatever reasons (legacy, "normal" C APIs, whatever), most of the Unicode APIs have also equivalent ANSI API where encoding does matter, and it depends on some codepage (to convert it internally to/from Unicode). The exact codepage may depend on context, and it's not necessarily the exact same with all ANSI APIs. Notably the console input/output ANSI API codepage can differ between themselves, and between the general system codepage. Example of "standard" ANSI libc API which depends on codepage is fopen (takes char arguments), where the equivalent Unicode API which takes wchar_t arguments is _wfopen.

So from all the above, if someone sticks to the Unicode APIs, which I hope is what tcl/tk do internally on Windows, then there should not be any encoding issues as far as CLI arguments to tcl (and gitk) go, and similarly no issues by default when invoking git with unicode arguments. No encoding parameters should be observed, used, or modified. It just works.

I'm very weak with tcl/tk, so if someone could post a 1-2 lines test tcl program which "forwards" it's CLI arguments to git (without touching or specifying any specific encoding), then I can try it and confirm that it works out of the box with Unicode arguments, or that it doesn't work, and then we'd need to figure out what's missing.

But let's assume that it works out of the box without any encoding issues.

What remains to handle is the encoding of streams. Specifically in our context, at the very least stdout from git which is captured by gitk and processed further.

Streams are an issue as far as encoding go, because the internal windows UTF16-LE is highly uncommon to write to stdout, or read from stdin.

The obvious alternative encoding to use, if we wish to retain portability between applications without data loss (assuming the string is representable in Unicode), is UTF-8, which is very effective in all inter-process streams. These are stdin/out/err, but also content of text files on disk, etc, and so fputs(utf8_buf, stdout) works great for all of those, despite being an ANSI API. The byte-stream will reach its destination unmodified, and ready to be processed as UTF8 by something else.

First problem with UTF-8, however, is that typically this is not the system ANSI codepage, which Windows has for whatever reasons (at least legacy, DOS compat, but maybe more). So now the application faces an issue: it won't use UTF16-LE because no one expects that. It could use the system ANSI codepage encoding for this stream, knowing that data-loss is guaranteed in some cases, or it could decide to ignore the system ANSI codepage and use UTF-8 anyway.

Many modern portable applications do just that - use UTF8 stdin/out even if it doesn't match the system ANSI codepage. I think this is what git does too, at the very least when the respective stream is not a TTY.

But TTY is the other issue with encoding, if one is not careful.

Console Unicode APIs do exist, and using those would guarantee correct unicode display of unicode strings regardless of the console codepage. Same goes for input typed into the console. (up to missing glyphs which affect the display, but the unicode values still pass correctly via the APIs regardless).

So the right thing which a modern portable application should do on windows when printing to stdout is to use the console Unicode APIs if stdout is a TTY, or print UTF8 otherwise.

Same with stdin - use the Unicode API if it's a tty, else assume it's UTF8.

Whether the application uses wchar_t internally for everything and convert to/from UTF8 at the stream "edges", or the other way around (UTF8 internally but converted to/from wchar_t with Windows APIs`) is up to the application.

I believe "the other way around" is more common with portable applications, because then internally it can keep using C strings everywhere, which remains compatible on all platforms, but at the windows APIs edges it's converted to/from windows wchar_t.

Lesser applications can do other things. Ranging from using only ANSI Windows APIs in general (so unicode CLI arguments are not guaranteed to work, not even with fopen), and/or not do any conversion when writing to stdout, to maybe convert to the console codepage if stdout is a TTY, etc. But all of these guarantee data loss or incorrect display at least in some cases (if some value is not representable at the respective codepage).

The question is where do git, gitk, and tcl stand in this landscape.

Hopefully everything works out of the box between gitk and git with any Unicode arguments, and that the only non-obvious thing which gitk should do is know that the output from git's stdout is UTF8, and take actions accordingly.

EDIT: there's another option to remain portable, and that's to specify that the ANSI APIs are UTF8. This is a magical solution for almost everything. char **argv is UTF8, fopen works with any Unicode value as normal char C string, etc. The gotchas here that the console is unaffected by these (still need to use the console APIs and fputs is still not enough), and that it's not something the application can request at runtime. Currently the way to do that is either set a special windows options that "applications whatever UTF8", or attach a UTF8 manifest to the application, which instruct the system to use UTF8 ANSI APIs (only win10+). Also, if built with UCRT (not msvcrt), then the application can set the locale at runtime to UTF8, with similar effect, but argv is not magically modified, and it only affects the crt APIs, i.e. largely libc, like fopen, but unlike CreateProcess.

@kkato233
Copy link
Contributor Author

pull request move to #9

@kkato233 kkato233 closed this Feb 20, 2025
@kkato233
Copy link
Contributor Author

see git file name encoding UTF-8 documents.

https://git-scm.com/docs/git-commit/2.13.7#_discussion

@kkato233 kkato233 deleted the fix_filename_encoding branch February 20, 2025 17:42
@avih
Copy link
Contributor

avih commented Feb 23, 2025

EDIT: this part might very well actually work, because it would get printed to stdout in cp1252, which is likely the exact same byte-stream which it got from git - which was UTF8.

I think there's still an issue which my unicode test script didn't excercise, and it fails due to the quoted comment subject, and so far the only solution I have for it is via the system encoding, and setting the fd encoding is not enough.

This is an issue after the "fix" above which configures the fd as utf8.

The test script at #7 (comment) tested, among others, that it can read UTF-8 from git's stdout, and concluded that setting either the system encoding or the fd encoding to utf-8 makes it read the correct unicode value (and then it can open it as file etc).

What it didn't test, however, is writing unicode to git's stdin, and I think it fails when the stdin string has correct unicode value.

What that EDIT comment says is that with default (ANSI) system encoding, if we read git's stdout into a variable, and then feed this variable back to git's stdin ("full circle use case"), then it seems to work, but the variable has incorrect unicode value.

That's because any arbitrary stream of bytes (maybe except byte value 0) is a valid string in the single-byte encodings (in my case cp1252).

So if we don't set the system encoding, and read git's stdin, then it's interpreted incorrectly as cp1252, but it's still a valid cp1252 byte stream, so no information is lost when it becomes (wrong) unicode string in tcl.

Then, when we feed it back into git's stdin using open ... <<$value, then this value is translated into cp1252 byte stream, which is the exact inverse of the translation it did previously, and it reproduces bit-identical stream to the one we read previously (which git processes correctly) - even though the value was wrong as far as tcl's unicode is concerned.

Now, if we fix the first read part by configuring the fd to utf-8, so that tcl has a correct unicode value, and then we try to write it to git's stdin, then, like before, it translates it into a cp1252 stream for git's stdin.

But now, because the unicode string is correct, but the default stream encoding is still cp1252, then it no longer translates to the original byte-stream which we read, and now git doesn't understand the value.

What we need here is to tell open that the value which we use at the <<$value part should be converted to a utf8 stream.

The problem is, I don't think there's a way to do that. We can do it while reading because we have an fd which we can configure, but we can't configure anything at the output stream which open generates for the <<$value thing, and this seems to use the default system encoding.

I said previously that the fixed log now looks correct, and this is true

{| git.exe log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin {<<f3431b1756435a4bddb3627064c9ab34fbd54dd9
--
あいう.txt}} r

But what it doesn't show is the byte stream which git sees, and whether git was able to digest it correctly, and I think it doesn't, or at least not always.

I'm not clear exactly about the details, but if I add to git a new file spaces あいう.txt (one space after "spaces"), and then launch gitk "spaces あいう.txt" then the missing-diff issue returns.

I know, it looks like a spaces-in-filename issue and not an encoding issue, but if I do the same with a file foo bar then there's no issue and the diff does show.

Additionally, if I set the system encoding to utf-8, then the spaces あいう.txt missing-diff is fixed.

I'm baffled, but it might be a weird combo of incorrect arguments to open (the concat/eval/list thing), combined with an encoding issue, because spaces alone or encoding alone seem fine with the patch, but spaces and encoding breaks, but then system encoding fixes it.

There might be some relation to #4 (comment) but I don't know currently what's noise and what's related.

Still investigating, and any insight others have might help. I should probably get back to the test script and add this operation too, to test it in isolations and understand whether there are new factors in play.

@avih
Copy link
Contributor

avih commented Feb 23, 2025

But now, because the unicode string is correct, but the default stream encoding is still cp1252, then it no longer translates to the original byte-stream which we read, and now git doesn't understand the value.

So this is correct. With default (ANSI) encoding and correct unicode variable (because we configured fd to utf8), when using open or exec with <<$value stdin to git, we send it garbage, because it tries to translate it to cp1252 byte-stream, but none of the 3 Japanese codepoint can be represented, and so we get 3 literal ? (0x3f) instead.

So for あいう.txt we send ???.txt, and for spaces あいう.txt we send spaces ???.txt.

I'm baffled, but it might be a weird combo of incorrect arguments to open (the concat/eval/list thing), combined with an encoding issue, because spaces alone or encoding alone seem fine with the patch, but spaces and encoding breaks, but then system encoding fixes it.

This is also correct, but for a weird reason, and gitk doesn't do the right thing with the unicode name either way, as described above.

But somehow git finds a commit with one of the names, but not with the other, despite both names in its stdin being equally garbage.

From shell, with correct names, same arguments which gitk uses, and correct utf8 stdin for git:

$ printf %s\\n 02aa13611f2a068bbce53e336d4a5bc29fac679f -- "あいう.txt" \
| git log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin | head -1
commit f3431b1756435a4bddb3627064c9ab34fbd54dd9

$ printf %s\\n 02aa13611f2a068bbce53e336d4a5bc29fac679f -- "spaces あいう.txt" \
| git log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin | head -1
commit 4baac406841cb30dcdbb05acccc0556f5052c154

From shell, with identical garbage names which gitk passes as stdin to git when the unicode variable has correct name:

$ printf %s\\n 02aa13611f2a068bbce53e336d4a5bc29fac679f -- "???.txt" \
| git log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin | head -1
commit f3431b1756435a4bddb3627064c9ab34fbd54dd9

$ # why did that "work"? with the same commit-hash result like with the correct name...

$ printf %s\\n 02aa13611f2a068bbce53e336d4a5bc29fac679f -- "spaces ???.txt" \
| git log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin | head -1

# but this didn't ?!

Anyway, gitk doesn't do the right thing, and I also confirmed it by adding an identical open and read command with the same stdin just before this is executed in gitk, but invoking od -c or xxd instead of git... and indeed we send three 0x3f bytes instead of the Japanese codepoints - with both names.

So this is problem which needs to be fixed, and at least one solution is global utf8 system encoding, which makes everything just work.

It might be possible to use the same approach like the patch fixes git's stdout, but also for stdin, because I believe the fd from open is bidirectional, so we can run the open command without the <<$thing part, and instead, after we configure fd to utf-8, write this value to fd.

I didn't test that, and I also suspect this may lead to deadlock if the stdin value is very big, unless we starts doing alternate non-blocking reads/write with that fd (e.g. if git processes each input line as it reads it, and has some output which it expects us to read, but we're currently writing, then git is blocked because we're not reading it, and we're not writing because git is blocked and doesn't read).

The deadlock scenario might be possible with << as well, but hopefully tcl knows to do that read/write dance on its own when using <<.

@avih
Copy link
Contributor

avih commented Feb 24, 2025

By the way, I know that at least one of the new convert from utf-8 does work. I also added an encoding wrapper similar to the open and exec wrappers, which also prints the before/after values, and this shows at the log:

::encoding( 8183) convertfrom utf-8 ããã.txt --> あいう.txt

and line 8183 is a bit different from gitk master due to the additional debug code I added.

In master it's at gettreediffline line 8125.

I didn't try to follow where this value come from, or how it works, but I do find it unexpected, because in my test script such conversion didn't work, at least not when the value was read from git's stdout.

For reference, here's my wrapper and dbg function, which I also use from the existing open and exec wrappers, to log the line from which it's called (the encoding wrapper doesn't pretend to cover all the possible use cases, but it does cover the cases in gitk, though it's probably missing uplevel):

proc dbg_lnum {{lvl 0}} {
    set line 1
    while {$line == 1} {
        incr lvl -1
        set line [dict get [info frame $lvl] line]
    }
    return $line
}

proc dbg {data {linelev 0} {title ""}} {
    if {$title == ""} {set title [dict get [info frame -1] proc]}
    set line [dbg_lnum [expr $linelev - 1]]
    puts stderr "$title\([format %5s $line]) $data"
}

rename encoding real_encoding
proc encoding {opt {enc ""} {data ""}} {
    if {$enc == ""} {
        set r [real_encoding $opt]
        # dbg [list $opt --> $r] -1
        return $r
    } else {
        set r [real_encoding $opt $enc $data]
        dbg [list $opt $enc $data --> $r] -1
        return $r
    }
}

@kkato233
Copy link
Contributor Author

kkato233 commented Feb 24, 2025

@avih @j6t

Can you take a look at the following commit code?

kkato233@1e72512

With this,

gitk あいう.txt
gitk ./あいう.txt
gitk .\あいう.txt
gitk ABC.txt
gitk ./ABC.txt
gitk .\ABC.txt

All of them now work as expected.

@avih
Copy link
Contributor

avih commented Feb 24, 2025

Can you take a look at the following commit code?

kkato233@1e72512

Interesting, where does this come from? It's from today, and the author is not you or me...

EDIT: oops, it is you! I thought this comment was by @j6t . sorry!

I did look at those exact places too, and even traced the escape thing back to the original reported bugs (not right now, yesterday), so I have ready opinions about these already:

-		lappend escaped [string map {\\ \\\\ "\ " "\\\ "} $path]
+		lappend escaped [string map {\\ "/" } $path]

This was the obvious alternative fix IMO too (are the quotes required for /?)

But the issue which this is supposed to fix, not sure if 100% correctly, is when invoked from a cmd.exe shell, where two things can happen:

  • Indeed backslashes are the default (and also what cmd.exe tab-completes for paths), where this patch is IMO much better then the original of escaping them for git better, and what I thought it should do too.
  • Apparently it tries to also fix the super weird cmd.exe quotes rules, supposedly under the assumption or knowledge (which I didn't check), that the CLI arguments which a tcl script sees are somehow not the correct argument value, but rather includes all or some remains of the cmd.exe weird quoting construction (imagine if your program saw quotes/escapes at the value which are result of how the user quoted it when they typed it at the shell). If that is indeed the case, then it's probably incomplete, because the second map pair is only one case of it. The general case needs to be handled iteratively and not with fixed replacements.

The examples you posted of the backslash-in-path don't excercise the second map pair, which I have to presume is there for a reason (not enough info at the bug report or commit message which added those. I don't have the link to the report, but it wasn't too hard to track it, starting with git blame).

One way to generate inputs which supposedly this tries to fix is to tab-complete in cmd.exe names with spaces and double-quotes (which are legal in file names) - in the begining/middle/end of path-components. Not sure which combos exactly, but to me it looks like this is what the 2nd map pair tries to address.

I know that this code is only used with paths, but the same issue might also happen with arbitrary user input (but I don't know whether gitk expects arbitrary CLI args which are not paths).

-                        --parents --boundary $args --stdin \
-                        "<<[join [concat $revs "--" \
-                                [escape_filter_paths $files]] "\\n"]"] r]
+                        --parents --boundary $args \
+                        [list $revs "--" [escape_filter_paths $files]]] r]

[...]

-                        --parents --boundary $args --stdin \
-                        "<<[join [concat $revs "--" \
-                                [escape_filter_paths \
-                                        $vfilelimit($view)]] "\\n"]"] r]
+                        --parents --boundary $args \
+                        [list $revs "--" [escape_filter_paths $vfilelimit($view)]] r]]

I didn't look into it very carefully, but superficially it looks to me a much better construction than the original of adding newlines escape sequences as a method to encode it in a way which the "eval" (or equivalent) at "open" would then "reverse". It also looks similar to how I constructed the arguments to open and exec in my experiments, which programatically handles the preparation instead of needing to care and think about how the construction would then be interpreted with "eval".

These are also the exact places which made me think how this handles spaces in names (which started when I noticed the args diff before/after the sanitation), and the reason I tested it, and that whole line of remaining issues with Unicode and stdin to git. (empirically it seems like spaces are handled correctly, or at least seem to work, but I was highly distracted by the stdin-to-git issue that I never got back to analyzing the handling of spaces).

The rest look fine. Somewhat familiar.

Correct me if I'm wrong, but I'm not seeing here a fix for stdin-to-git, right?

I.e. this part is still broken also with this patch, and possibly even more broken than before in some cases, due to this "full circle case, value to git's stdin is correct despite unicode value being incorrect", because now it's never correct (with unicode values), but previously it was correct in some or all cases (but of course issues elsewhere due to incorrect unicode).

@avih
Copy link
Contributor

avih commented Feb 24, 2025

Correct me if I'm wrong, but I'm not seeing here a fix for stdin-to-git, right?

Oops, I was wrong...

-                        --parents --boundary $args --stdin \
-                        "<<[join [concat $revs "--" \
-                                [escape_filter_paths $files]] "\\n"]"] r]
+                        --parents --boundary $args \
+                        [list $revs "--" [escape_filter_paths $files]]] r]

Actually, this looks wrong, and same for the other similar diff.

It might avoid the stdin-to-git issue altogether, which I noticed just now, but I did check that too yesterday, and it was added for a reason - because it ran into limitations of command-line length, so this patch would bring back that limitation.

IIRC It was added originally at git-for-windows, so I can only imagine this limitation was observed on windows. No idea how much it holds true today, or what the limit is on supported platforms, e.g. it's possible that it was added to address a Windows XP limitation, but now only win10+ is supported...

Like the other commit message or bug report, IIRC there are no test cases to reproduce the issue which this is supposed to solve...

@kkato233
Copy link
Contributor Author

@avih @j6t

check next commit kkato233@c43098b

This is a pattern where the temporary file is explicitly created in UTF-8.
Since command parameters are not used, additional escape processing is unnecessary.

@avih
Copy link
Contributor

avih commented Feb 24, 2025

This is a pattern where the temporary file is explicitly created in UTF-8.

I think it can work.

Just ensure that the file is not deleted until after we're read all the output we need from open or exec (this might already be the case, I didn't follow the code closely, but it is something to ensure).

Personally I'm not fond of creating temporary files routinely only to bypass the encoding issue, which became a problem only when the patch landed to workaround command line limitation issues.

First, we don't know whether the original command length limit still exists today. If it doesn't, then we can revert the stdin patch and this problem goes away (but we still need to address the stdout-from-git issue).

Also, because setting the system encoding solves both these problems much more elegantly.

As I noted earlier, modern portable programs on windows use utf8 input/output by default, because assuming streams are whatever default ANSI codepage is lossy-by-definition. It's not strictly a bug, but I think it was a wrong design choice in tcl. Luckily, it seems there's a solution with one line at the begining of the script.

This is even more true with gitk, because git is documented officially to use utf8 input/output, and that's the main process which gitk interacts with.

But another issue is that the tcl docs seems generally against changing the system encoding. From my few experiments, I couldn't identify an issue on Windows when the default encoding is utf8. This includes things which the tcl docs says are affected by this, but I don't pretend to know or having tested the full potential scope of it.

So my assessment is that changing it to utf8 by default on windows is 100% net win, and the docs "warning" shouldn't actually include window, or maybe even say the oposite - that it's recommended on windows, but I can't find any data sources to confirm or disprove that, and I'm not going to start reading tcl source code for that.

Even so, IMO we should change the default encoding (windows only), maybe remove all the explicit utf8 conversions (though they might also be useful on other platforms), and wait for relevant bug reports. I think git-for-windows has enough gitk users that if this causes problems then it will come up hopefully sooner rather than later, and then we can re-assess it.

But my hunch is that, up to leftover bugs, this would be a net win.

@avih
Copy link
Contributor

avih commented Feb 24, 2025

By the way, I know that at least one of the new convert from utf-8 does work.

Few things:

  1. That utf8 conversion at line 8125 is still required, or at least seems to convert correctly, even with the global encoding set to utf8.
  2. I identified one more place where the explicit conversion apparently does the right/needed thing even with the encoding set to utf8 (upstream gitk line 8270, see below).
  3. I didn't identify places which the global utf8 encoding breaks with an additional explicit conversion (might seem obvious, but good to know).
  4. Not all the explicit utf8 conversions lines show at the log.

Here's the part of the log which contains the conversions (extracted as a full block, so not all lines are encoding-related). The first line number is in my modified code, the second number is in gitk upstream (8183 in my code is 8125 in gitk master, etc).

This is with global encoding set to utf8, so supposedly places which convert from bad form to good form are indeed required even in this case, like the first line:

::encoding( 8183 (8125)) convertfrom utf-8 {spaces ããã.txt} --> {spaces あいう.txt}
::exec(12503 (12445)) git check-attr encoding -- {spaces あいう.txt}
::open( 8273 (8215)) {| git diff-tree -r -p --textconv --submodule -C --cc --no-commit-id -U30 --root 4baac406841cb30dcdbb05acccc0556f5052c154 -- {spaces あいう.txt}} r
::exec(  655 (597)) taskkill /pid 19416
::open( 5586 (5528)) {|git diff-files -- {spaces あいう.txt}} r
::encoding( 8328 (8270)) convertfrom utf-8 {spaces ããã.txt} --> {spaces あいう.txt}
::encoding( 8511 (8453)) convertfrom utf-8 {new file mode 100644} --> {new file mode 100644}
::encoding( 8511 (8453)) convertfrom utf-8 {index 0000000..d7732a5} --> {index 0000000..d7732a5}
::encoding( 8452 (8394)) convertfrom utf-8 {@@ -0,0 +1 @@} --> {@@ -0,0 +1 @@}
::encoding( 8516 (8458)) convertfrom utf-8 +xxxspaces --> +xxxspaces

Note the last one with xxxspaces, which looks weird. There's no file name or file content or commit message which includes this string as far as I can tell, so I can't guess where it comes from (there is one file name which includes "spaces", and one file name which is "xxx", but not the string "xxxspaces").

EDIT: opps, the file spaces あいう.txt does include "xxxspaces" as its content, but my search program probably failed to open this file because it doesn't handle unicode file names correctly... (note to self, fix it or find something better - ag, but then I tried grep (in unicode busybox-w32), which did find it...).

EDIT2: also accidentally open still had two printouts - before and after the args sanitation, so I removed the second one in the two open calls at this log.

Also note that some of the conversions are apparently of content where encoding doesn't matter, but maybe there can be other content at these lines which does require conversion.

Unrelated, it might be useful to add some debug option to gitk, along the lines of what I posted earlier, but only enabled with some CLI option, like -v or some such (and move the line number to the begining). This output log is very very useful, and it can also be attached to bug reports to get a better overview which may help analyze the issue better. Maybe such debug log output should also include some values like platform, tcl version, git version, etc.

Also unrelated, one thing which global utf8 does not fix is literal utf8 strings at the script source file, which most editors would display correctly when editing the source code. These appear to always be interpreted as the system ANSI codepage, even if setting the global encoding comes before these literal strings at the source code. So there's that too... (gitk probably unaffected, but tcl scripts in general probably are)

@avih
Copy link
Contributor

avih commented Feb 25, 2025

It might be possible to use the same approach like the patch fixes git's stdout, but also for stdin, because I believe the fd from open is bidirectional, so we can run the open command without the <<$thing part, and instead, after we configure fd to utf-8, write this value to fd.

I didn't test that, and I also suspect this may lead to deadlock ...if the stdin value is very big ...

So this is a problem on several levels:

Confirmed potential deadlock with this approach and big stdin:

set x [exec cat gitk]
set fd [open "| cat" r+]
puts $fd $x
# unreachable due to deadlock in puts, but writing a short value does work

Another issue is that cat does not exit until we close it's stdin, but with open we only have one fd - which we can't close before reading from it..., so even if we write a short value (which doesn't deadlock), and then try read, then it deadlocks again, because after we drained cat's stdout buffer, it now waits for more stdin, so it doesn't exit and doesn't close its stdout, and read never completes.

The deadlock scenario might be possible with << as well, but hopefully tcl knows to do that read/write dance on its own when using <<.

Confirmed. Both issues are avoided using the << immediate stdin:

set x [exec cat gitk]
set y [exec cat <<$x]
puts "(($y))"

or

set x [exec cat gitk]
set fd [open [list | cat <<$x] r]
set y [read $fd]
puts "(($y))"

which prints the gitk source code inside ((...)), but now we can't configure the << stream to utf-8...

@avih
Copy link
Contributor

avih commented Feb 25, 2025

But another issue is that the tcl docs seems generally against changing the system encoding. From my few experiments, I couldn't identify an issue on Windows when the default encoding is utf8.

So, first, this is what tcl 8.6 docs say about changing the system encoding, which is not much:

Set the system encoding to encoding. If encoding is omitted then the command returns the current system encoding. The system encoding is used whenever Tcl passes strings to system calls.

Then, there's more info at the wiki here, adding a "glob" example, the fact that it's a per-process thing, and that it's not recommended:

Gets or sets the system encoding to encoding. If encoding is omitted, the command returns the current system encoding. The system encoding is used whenever Tcl passes strings to system calls, e.g. open, or retrieves them from system calls, e.g. glob. It is also the default encoding for newly created channels.

The system encoding is a process-global resource: It is shared among all interpreters in the current process, and can be changed by any of them.

You should not normally have to set the system encoding. If Tcl is getting its initial guess wrong, this is a bug and will impact on most system calls.

Then, I talked to a very kind person at the tcl IRC channel, which seems very knowledgable (I'd guess he's a tcl developer), which said few interesting things:

  • The main reason for tcl defaulting to the windows ANSI codepage is because indeed this is the only explicit global encoding which windows has. They can't use something else because there's nothing else applications can agree on without some prior knowledge or negotiations. Luckily, with git we do have this prior knowledge - git uses utf8.
  • Changing the system encoding indeed doesn't really affect "system APIs" on windows because indeed tcl uses the wide APIs on windows. I did also ask specifically about glob, and the answer was that this also uses the W APIs, so unaffected on windows. This is what I also observed as well.
  • He's not sure whether things are effected negatively, and not sure what the exact scope of it is on Windows. He checked the source and concluded that something with temp files and something with host name might be affected by encoding configuration. I can probably test temp files, but I don't think gitk cares about the hostname.
  • He said that tcl 8.6 may or does have issues with non-BMP codepoints (U+10000 or more), like emojis, and that this is fixed with tcl 9. I can test emojis with gitk.
  • To avoid the deadlock and other issues with open and configurable fd, he suggested to use chan pipe, which apparently can have two fd's (which we can configure, do non-blocking read/write dance, and close one while still reading from the other). I did notice that earlier at one of the wiki pages, but didn't look into it so far.
  • He said that exec ... <<$thing and similarly open use temp files for the << stream. This makes sense to avoid the deadlock issue, and also means that the patch by @kkato233 to use a temp file is not necessarily worse that what tcl does internally (I thought it's just do non blocking read/writes, but I guess a temp file works too).
  • Unrelated to gitk, about literal utf8 strings at the source code, he said that tcl 9 switched to interpreting the script source as utf8. tcl 8 interprets them as the default system encoding - before it starts to run the code, so changing the system encoding has no effect on that.

For me, also inportant, that despite the apparent knowledge, and except IPC streams which he knew of (and which for us get fixed by switching globally to utf8), there was nothing obvious he could think of which would break due to this. But again, that's not a promise, just off the top of his head.

So I think this is largely encouraging about changing the default on windows, and we should do that, and see what happens, and re-assess it if we get bug reports.

@avih
Copy link
Contributor

avih commented Feb 25, 2025

::encoding( 8183) convertfrom utf-8 ããã.txt --> あいう.txt

I didn't try to follow where this value come from, or how it works, but I do find it unexpected ...

This (encoding convertfrom utf-8) works when the value is read from an fd with fconfigure $fd -encoding binary, which indeed I didn't test initially. So now we know. There are 3 such fconfigure at the code.

As noted initially, this conversion doesn't work if the value was read with the ANSI (default) encoding.

If the entirety of these 3 binary streams is always utf8, then it feels more correct to configure the fd as utf8 instead of binary, and then the values are unicode-correct at an earlier stage of the data flow, and the later conversion to utf8 can be dropped - possibly in more than one place which deals with data from these streams.

But this is a real "if". I think it's also possible that the paths parts are decoded as utf8, while other parts are decoded as something else ($diffencoding?).

tcl 8.6 may or does have issues with non-BMP codepoints (U+10000 or more), like emojis, and that this is fixed with tcl 9. I can test emojis with gitk.

I did a quick test with a non-BMP emoji, and it shows correctly at the logs/diff when gitk is invoked without arguments, but, with the file name as argument then, like before, if bumps into the "sending junk to git's stdin" issue, and the global utf8 encoding fixes that too.

@kkato233
Copy link
Contributor Author

@avih

$diffencoding

from

git config --get gui.encoding

This is the text file's content encoding.
It is for displaying the text file in the preview area.

@dkfellows
Copy link

As the "very knowledgeable person" alluded to above (and on behalf of @tcltk, of whom I'm an admin) it's pretty accurate. Tcl's been strict about using the ...W API on Windows for a long time with a very few exceptions, only one of which relates to how data is passed to pipelines in a heredoc-like way. We know about many issues relating to exec; the problem is that there's too many compromises involved in producing something that works (these also apply when passing binary data through pipelines).

Possible actions

I advise either:

  1. Changing immediately before calling exec of the program that has the different encoding understanding, and then changing back immediately after. (Switching encodings is cheap if you're not multi-threaded.)
  2. Otherwise, write the data you want to a file (over which you have total control of encodings) and feed that in. That's how these heredoc-like things work behind the scenes. You can use file tempfile to help (which might be using some of the same machinery that exec uses).
  3. A third option involves passing the information via a pipeline created with chan pipe though I'm not sure how well that will work on Windows; I've simply never checked!

I also advise putting all this machinery inside a procedure so that the rest of your code can pretend it doesn't exist. You might already be doing this, and if you are then I approve strongly. (I've not looked. 😉)

Other notes

If you have problems with emoji, the only supported fix will be upgrading to Tcl 9.0. While fixes were proposed for the 8.* series, we've decided to not release an 8.7 to allow us to focus our development efforts. (The changes were large enough that there was no reasonable prospect of backporting to 8.6.)

Don't forget to seek support

Specific problems should be raised either as tickets (or for tk) or reported to the mailing list. When doing so, please mention that this is in relation to gitk; it's a known software project that we want to support.

@avih
Copy link
Contributor

avih commented Feb 26, 2025

I advise either: ...

Thanks for the comment.

These suggestions are valuable.

I know it's not the only suggestion, but changing encoding right before/after is obvious with exec, but less so with open, because the stream is processed by the calling code, and there's not obvious place (to me) to restore the encoding "after" while keeping it encapsulated in a way or place which the caller would not "see".

But even if there is, the caller does work before the stream is closed, and this work would happen with the encoding still utf8 - which is what this solution was hoping to avoid.

It could of course be refactored to not do any work before the whole stream is read and the encoding restored, effectively becomes exec, and it needs a refactor...

But more than everything, all of these are potential solutions which do look potent, but IMO only after we know there is a problem with the seemingly perfect solution of changing the global encoding once to utf8 on windows.

Obviously we don't know whether there are issues or not, and unfortunately it's also not easy to find an official answer other than source code analysis of the tcl itself.

But so far, in all my testing (which I don't presume cover everything), I couldn't identify even a single issue...

EDIT: or to be more specific, I couldn't yet identify a thing in tcl which behaves differently depending on global encoding on windows, other than streams - which is exactly the thing which we want to change to utf8 by default.

@avih
Copy link
Contributor

avih commented Feb 26, 2025

The suggestion:

  1. Changing immediately before calling exec of the program that has the different encoding understanding, and then changing back immediately after. (Switching encodings is cheap if you're not multi-threaded.)

My reply:

changing encoding right before/after is obvious with exec, but less so with open, because the stream is processed by the calling code ...

Hmm, it might actually be as simple as that, because apparently an fd which open returns when the system encoding was utf8 remains like that even after the global encoding is restored. So it's apparently not necessary to wait with the restore until after the fd was drained.

I also tested that [chan configure $fd] after the system encoding was restored shows that it's indeed configured as utf-8.

I also tested reading ~ 100K output from such fd after the system encoding was restored, and it remained utf8 for the whole read, so it's not that the first BUFSIZE happened to be interpreted correctly before the encoding was restored. It actually stays like that for as long as we have use for that (but can still be modified with fcondigure if needed).

Here's a patch for that, on top of gitk master (without any other patch which was suggested previously), which currently modifies only open and exec on Windows, and apparently fixes everything... (are there other commands in gitk which use streams?):

diff --git a/gitk b/gitk
index bc9efa1..0aef523 100755
--- a/gitk
+++ b/gitk
@@ -40,6 +40,19 @@ proc is_Cygwin {} {
 	return $_iscygwin
 }
 
+# run $cmd list with a specific global encoding set for the duration of $cmd.
+# useful for using with exec or open etc to get utf8 streams with git when the
+# system encoding is not utf-8, like windows, but possibly elsewhere too.
+# an fd which "open" returns remains configured as $enc even after restoring
+# the global encoding to something else, but fconfigure can still modify it.
+proc run_enc {cmd enc} {
+    set orig [encoding system]
+    encoding system $enc
+    set r [uplevel 1 $cmd]
+    encoding system $orig
+    return $r
+}
+
 ######################################################################
 ##
 ## PATH lookup
@@ -126,7 +139,11 @@ proc exec {args} {
 		}
 	}
 	set args [sanitize_command_line $args $i]
-	uplevel 1 real_exec $args
+	if {[is_Windows]} {
+		run_enc [list uplevel 1 real_exec $args] utf-8
+	} else {
+		uplevel 1 real_exec $args
+	}
 }
 
 # Override `open` to avoid unsafe PATH lookup
@@ -139,7 +156,11 @@ proc open {args} {
 		set command_line [string trim [string range $arg0 1 end]]
 		lset args 0 "| [sanitize_command_line $command_line 0]"
 	}
-	uplevel 1 real_open $args
+	if {[is_Windows]} {
+		run_enc [list uplevel 1 real_open $args] utf-8
+	} else {
+		uplevel 1 real_open $args
+	}
 }
 
 # End of safe PATH lookup stuff

I also tested inside gitk itself to run the same stdin into od -c or xxd instead of git, which now shows that they indeed see the correct utf8 byte values, while previously (the patch which only fixes one fd at line ~ 12700) the stdin to git was broken and meaningless when unicode content was invoked.

One important thing though, is that original suggestion also included:

Switching encodings is cheap if you're not multi-threaded

And I don't know whether gitk is multi-threaded...

@dkfellows many thanks yet again. As far as I can tell this is ideal...

Would still be nice to know whether global utf8 encoding on windows changes anything other than streams, but at least we now hopefully have a good generic solution in case the "here be dragons" documentation for system encoding is not wrong... (but my assessment for windows is still that it's wrong).

@j6t
Copy link
Owner

j6t commented Feb 26, 2025

@dkfellows Many thanks for taking the time to propose solutions. @avih thank you again for the extensive testing.

If I understand the concept of the system encoding correctly, the warning not to change it is based on the assumption that the system encoding is the "language" in which Tcl communicates with the environment. This "language" is dictated by the locale on Unix and the ANSI codepage on Windows. If it is changed, the Tcl script talks in the wrong way to the world.

Is this reasonable or am I way off the truth?

In the case of Gitk, I have to wonder: Whom do we actually talk to and in what language? We talk to git exclusively (perhaps except for command line arguments), and on Windows git has long settled that its input and output is UTF-8 (with exceptions that are not yet relevant here). For this reason, I begin to think that setting the system encoding to UTF-8 once and for all should be the right thing to do, at least on Windows.

Or is this the wrong conclusion?

@avih
Copy link
Contributor

avih commented Feb 26, 2025

+proc run_enc {cmd enc} {
+    set orig [encoding system]
+    encoding system $enc
+    set r [uplevel 1 $cmd]
+    encoding system $orig
+    return $r
+}

A slightly different alternative to this implementation, is to grab the default encoding once on startup, and use that when restoring the original encoding instead of capturing this value at the begining of this function.

Because this implementation can result in permanently changing the default encoding if this functions is called from two threads concurrently (thread A changed to utf8 and runs exec. before it completes, thread B calls this function and incorretly sets $orig to utf8 - because the encoding is per process and not per thread, thread A then completes and restores the global encoding, and then thread B compltes and "restores" the global encoding to utf8, and that's how it will stay for the rest of the runtime of that script).

The suggested alternative is still not thread safe (and maybe that can be addressed too), but at most it would break during concurrent calls from two threads, but without a lasting side effect like the original implementation quoted above.

@kkato233
Copy link
Contributor Author

@avih @j6t

Please refer to commit
kkato233@da4be8d

Changing immediately before calling exec of the program that has the different encoding understanding,
and then changing back immediately after.

This resolves issues in
Case 1
and
Case 2.

@avih
Copy link
Contributor

avih commented Feb 27, 2025

Please refer to commit
kkato233@da4be8d

I didn't try it, but I'm guessing it works.

The hard part here IMO is/was identifying the exact problems and which approaches can address them, and then taking a good decision.

I think by now we've identified many issues, which are hopefully all the issues in the context of unicode on windows, as well as few potential approaches and specific solutions.

Currently IMO the decision seems between global utf8 encoding once for the whole script, or not.

If the global approach decision is taken, then the patch is one line.

If not, then there are several alternatives which were suggested here in terms of patches and data points, starting from handling explicitly only the cases which need a solution at one end, to the other end which makes exec and open utf8 by default without changing the global encoding or code in any other parts of gitk.

As much as I do have a preferred solution, which I've already stated, the decision is by @j6t so let's wait for that, and take it from there once a decision is made.

Either way, I don't think writing the actual code should be particularly challenging. It's just implementation details.

@avih
Copy link
Contributor

avih commented Feb 27, 2025

Just as another data point, tcl 9 apparently switched to utf-8 system encoding by default on windows, instead of the ANSI codepage, at least on win10+.

This doesn't solve any streams interoperability issues which the default ANSI encoding supposedly solves, and talking to an app which expects some ANSI codepage in its stdin would be broken (because tcl would output utf8, but the other app didn't suddenly start understanding utf8).

It effectively pretends that the global windows system encoding is utf-8, and runs with that (it uses the utf8 manifest, which I mentioned at the bottom of this post)

It still uses the Unicode windows APIs like before, and not the char plain C APIs, so all the system APIs which were unaffected by the tcl encoding in tcl 8.6 are equally unaffected by the encoding in tcl 9, but the main thing which does get affected is that now streams, which ask "what's the system encoding", get a reply "utf-8", and so tcl converts unicode to utf8 in streams.

Note that it's possible that tcl uses internally the char APIs in some places, and these would be affected by the utf8 manifest, because these char APIs indeed become utf8 in tcl 9, rather than ANSI in tcl 8.

I've been looking at the tcl source code, but so far I couldn't identify what they are (the char APIs which tcl uses on Windows), but as noted above by a tcl member, there should be extremely few places where these are used, if at all. The main thing which is affected by this "switch" is that fake reply that the windows global ANSI encoding is utf8.

Personally I think it's another argument to change the global tcl encoding ourselves once and for all, just like tcl 9 decided to fake the windows global encoding.

utf8 manifest in tcl9: https://github.com/tcltk/tcl/blob/7532f625dd115d13de41c9e696cdf6f1124b270c/win/tclsh.exe.manifest.in#L40

The init code which decides what the system emcoding is, using GetACP() (active code page), which normally returns e.g. cp1252, but with the utf8 manifest it returns CP_UTF8: https://github.com/tcltk/tcl/blob/7532f625dd115d13de41c9e696cdf6f1124b270c/win/tclWinInit.c#L401

The init code has not changed between tcl8 and 9. only the utf8 manifest which changes the reply from GetACP, and which makes the char APIs utf8 (and which tcl doesn't use or barely uses).

Note that the utf8 manifest can be attached to tcl 8.6 pre-built binaries, and doesn't require re-compile, with identical effect (the char APIs become utf8, and GetACP returns CP_UTF8, and tcl streams would be utf8 by default out of the box on Windows). And same with any other application. It's a cheap method to make existing legacy binaries suddenly talk utf8.

It's just that the default ./configure && make doesn't do that with tcl8, and the tcl8 manifest file doesn't include the utf8 line, and so both my own-built binary, and the tclsh and wish in git-for-windows still default to the single-byte ANSI codepage... and a "this system encoding is cp1252" reply.

The manifest works on win10 1903+, has no effect in win7/8, and makes the app not run at all on winxp (assuming it did run on xp before the manifest was attached).

This is also the main "trick" which busybox-w32 does in its unicode build (in addition to other things, like custom TTY interaction for unicode, enabling utf8-aware line editing, and few more, but the main thing is the manifest, because it does use the char APIs for the most part internally).

@j6t
Copy link
Owner

j6t commented Feb 27, 2025

@kkato233

Please refer to commit kkato233@da4be8d

Thank you. But the patch mixes two issues. The undisputed issue is that lines containing file names are not read as UTF-8 from diff -cc. Please make a separate patch with these changes. We will need this patch in any case regardless of how we solve the other issue. If you need help with writing a good commit message, let me know.

The other part of your patch is a good solution if we do not switch to UTF-8 globally.

@j6t
Copy link
Owner

j6t commented Feb 27, 2025

@avih

Just as another data point, tcl 9 apparently switched to utf-8 system encoding by default on windows, instead of the ANSI codepage, at least on win10+.

Yeah, that's an interesting twist. I begin to like setting the encoding globally.

@avih
Copy link
Contributor

avih commented Feb 28, 2025

This is a refinement of the patch here, still in case the decision is not to change the global encoding once and for all, and still on top of gitk master, with the following differences:

  • Bugfix with uplevel level polarity (got confused by the reverse polarity of info frame), though this didn't have a negative effect in gitk that I could tell.
  • This patch doesn't touch existing code at all. The exec/open utf-8 wrappers are independent of the existing wrappers (args sanitation), and make it easier to add more wrapped commands which gitk might need in the future.
  • Still not thread-safe, because as far as I can tell gitk is not multi-threaded. Also, as far as I can tell a default build of tcl/tk is without threads support to begin with.

For clearer code construction, the additionl wrapper actually adds two levels of wrappers and not one, because the wrapper is generated programatically, and it's clearer when the generated code is small, and simply calls an existing "normal" procedure which does the actual work (same run_enc from the previous patch).

So overall, open/exec now have 3 levels of wrappers. the previous patch above has 2, gitk master has 1 wrapper level, and a week ago it had 0 wrappers.

Performance impact is within the noise level as far as I can tell, and over a single gitk run with few dozens open/exec, I couldn't measure a difference between gitk master and with this patch (all runs with/without the patch took ~ 2.4 s +/- ~ 100 ms).

diff --git a/gitk b/gitk
index bc9efa1..97116db 100755
--- a/gitk
+++ b/gitk
@@ -40,6 +40,47 @@ proc is_Cygwin {} {
 	return $_iscygwin
 }
 
+######################################################################
+##
+## Windows: Use utf-8 streams by default, because git outputs/reads UTF-8
+##
+## Streams are affected by the tcl system encoding, but System APIs on Windows
+## are normally not, with unclear exceptions. So instead of changing the global
+## tcl encoding once and live in fear, change the encoding to utf-8 right
+## before commands which create streams, and restore it right afterwards.
+## The streams remain configured as utf-8 even after the encoding is restored,
+## but fconfigure can still change that normally.
+
+# Set system encoding to $enc, run $cmd list, restore system encoding.
+# $cmd runs at uplevel $level relative to the caller (don't use #VAL).
+# performance impact is typically at most negligible.
+# Note: not thread-safe: system encoding is per-process, not per thread.
+proc run_enc {cmd enc {level 0}} {
+	set orig_enc [encoding system]
+	encoding system $enc
+	set r [uplevel [expr {$level+1}] $cmd]
+	encoding system $orig_enc
+	return $r
+}
+
+# make command $cmd_name execute with system encoding $enc from here on
+# e.g.
+#   wrap_cmd_enc foo utf-8
+# does:
+#   rename foo enc_wrap_orig_foo
+#   proc foo {args} {run_enc [concat enc_wrap_orig_foo $args] utf-8 1}
+proc wrap_cmd_enc {cmd_name enc} {
+	set orig_name enc_wrap_orig_$cmd_name
+	rename $cmd_name $orig_name
+	proc $cmd_name args "run_enc \[concat $orig_name \$args\] $enc 1"
+}
+
+# we need utf-8 with exec's <<... and open's fd-ret-val and <<...
+if {[is_Windows]} {
+	wrap_cmd_enc open utf-8
+	wrap_cmd_enc exec utf-8
+}
+
 ######################################################################
 ##
 ## PATH lookup

@avih
Copy link
Contributor

avih commented Mar 2, 2025

I did a quick test with a non-BMP emoji, and it shows correctly at the logs/diff when gitk is invoked without arguments, but, with the file name as argument then, like before, if bumps into the "sending junk to git's stdin" issue, and the global utf8 encoding fixes that too.

There might be some inaccuracies below, as I only learned of this recently.

TL;DR: since tcl 8.6.11 there's some support for non-BMP codepoints, in the sense that they most probably don't break when processed as whole strings (and maybe they did always break in 8.6.10).

But string length $x on a non-BMP codepoint curently returns 2, because internally it translates it to a UTF16 surrogates pair. But when this string is used with system APIs or streams, then it's translated back to the correct non-BMP UTF-8.

However, iterating a string "chars" using split $x "" should work, because split knows to not break these pairs in the middle (so a single non-BMP codepoint would result in one element from the split, despite its length being 2).

There are likely several things broken or unexpected beyond the length, and I don't know the exact scope of it. I'd imagine tcl doesn't make too many promises on what should or shouldn't work in 8.6.11+.

This is roughly analogous to how an old C program deals with UTF-8 input. It can work with the strings, it can tokenize based on ASCII chars (like space, etc), but not everything works as expected when trying to process individual "chars".

(but I'm guessing the 8.6.11 developers had to put in considerable effort to make that analogy generally transparent)

Tcl 9 supposedly doesn't have these issues, but If I understand correctly, the 8.6.11+ versions should still deal with them reasonably as long as one doesn't look too closely into these strings. I'd guess tokenization based on ASCII chars (and maybe any BMP codepoint) would still work.

This is the same on unix/linux and Windows, with the normal difference that on windows that final utf8 string is translated to a wchar_t string when using system APIs, but stays UTF-8 in streams, assuming the stream is configured as utf-8.

@avih
Copy link
Contributor

avih commented Mar 2, 2025

(and maybe they did always break in 8.6.10)

So I found an old 8.6.7 build of mine for windows, and did some experiments. My estimation based on these is that non-BMP codepoints in utf-8 input are completely broken. They probably translate to junk internally (judging by their bytelength - 8, and string/split length - 4), and observably don't translate back to the original non-BMP utf-8 input in utf-8 output streams, and likely the same with system APIs. This would effectively make 8.6.10 BMP-only as far as utf-8 is concerned. If I understand correctly, this is also the same on unix/windows, but I didn't try 8.6.10 on linux.

In gitk, non-BMP filename argument are broken with this 8.6.7 tcl/tk build (with my open/exec utf8 patch), with the emoji in the utf8 stream to git being 8 bytes instead of 4. This is most probably unfixable. Without specific arguments to gitk, the diff and filename do show at the bottom diff part, but the name is displayed incorrectly, again, probably unfixable with this version. I'm guessing it shows without argument because gitk only reads git's output in this case, without needing to communicate specific non-BMP names (but which it does need to do with a filename argument).

The gitk -- -- NAME trick doesn't work either. In this case, the emoji translates to 6 bytes in the utf-8 stream to git, which are the windows UTF16 surrogates pair, translated naively to two utf-8 "codepoints", and which I presume git does not identify as matching this file name.

The gitk -- -- NAME trick also doesn't work when changing gitk to use this name as direct argument to git, rather than in its stdin stream (removing --stdin etc, and confirming that it does work with plain names).

So I'm guessing this makes non-BMP codepoints broken in 8.6.10 also when used as arguments to the script even on windows (and then using them with system APIs, etc), and not only with streams.

It does feel unexpected to me, because I would have guessed that it should translate from windows UTF16 to tcl's BMP-only utf-8 without data-loss, and same in the other way back into windows wchar_t, but I didn't explore this further. The first part (win utf16 to tcl's utf8) presumably does work, judging by the 6-bytes utf8 "pair" mentioned above, so it's not obvious to me why it breaks the other way (when using it as direct argument to git). But IMO it's not worth exploring this version further.

EDIT: the arguments-to-script not working seems to be specific in gitk, and indeed probably not in general. Because gitk uses the file name first as direct argument to git - which does seem to work, but then grabs it from the utf8 output from git's reply, and here it blows up because tcl 8.6.7 can't digest non-BMP in utf8 inputs. Then, when it tries to use it again (even as argument), it's junk value which git (or windows APIs) can't restore to the original utf16 or utf8 value.

So on windows, non-BMP in CLI arguments to scripts should work with windows APIs (and probably file names, etc), but still break when trying to digest or output correct utf-8 in streams.

@avih
Copy link
Contributor

avih commented Mar 4, 2025

Can you take a look at the following commit code?

kkato233@1e72512

This is what escape_filter_paths does in gitk master for each element at the list:

lappend escaped [string map {\\ \\\\ "\ " "\\\ "} $path]

And this is the gist of the patch suggestd above (which also touches escape_filter_paths , and which seems OK, other than bringing back the CLI limit described below):

-                        --parents --boundary $args --stdin \
-                        "<<[join [concat $revs "--" \
-                                [escape_filter_paths $files]] "\\n"]"] r]
+                        --parents --boundary $args \
+                        [list $revs "--" [escape_filter_paths $files]]] r]
-                        --parents --boundary $args --stdin \
-                        "<<[join [concat $revs "--" \
-                                [escape_filter_paths \
-                                        $vfilelimit($view)]] "\\n"]"] r]
+                        --parents --boundary $args \
+                        [list $revs "--" [escape_filter_paths $vfilelimit($view)]] r]]

But I misunderstood what escape_filter_paths does, and also why.

It's definitely not to address quoting issues in cmd.exe (and most likely there are no such issues), and also git doesn't have issues with backslash in filename, neither as arguments nor in its stdin.

Initially revs/files were passed as normal (CLI) arguments to git - which is also what the patch wanted to bring back. This was 100% good code.

But it apparently bumped into command line length limit with many files and/or revision arguments, and therefore this was changed to send the revs/files to git via its stdin using the <<... argument to open (at bb5cb23), and escaped using escape_filter_paths at 7dd272e (apparently in the same patchset), which is one reason why the encoding now needs care on windows - because it becomes a stream instead of arguments (and which this patch "fixes", but brings back the CLI limit).

Regardless of encoding issues, this stdin now becomes another value at the argument to open, and as noted at #7 (comment) , the arguments to open need special care and attention, because internally open performs eval, and all sorts of bad things can happen if the arguments are not quoted correctly. This is the same reason that eval is generally discouraged also in places which are not tcl.

However, these commits didn't take everything into account. Instead, they only took care of 3 things:

  • Newlines at the value, because the stdin value to git is newline-separated. This is addressed in the first commit by joining the items using \\n (literal tcl escape sequence) instead of normal newline (\n), so that after eval it becomes normal \n as part of the value, which tcl indeed applies as newline embedded at the value.
  • Backslashes at the value. This is addressed at the second commit at escape_filter_paths. Similar to the literal \\n escape, it replaces \ with \\, because we need it literally, and otherwise the eval in open would apply it as an escape sequence. For instance a file name of foo\bar would be applied exactly like the C-string foo\bar where \b is the backspace control character, so it becomes literal backslash by doubling it before eval.
  • Spaces at the values, also addressed at the second commit at escape_filter_paths by changing space into "\\ " (literal-backslash and space, which also has a redundant \ in both "\ " and "\\\ ", but it doesn't change the result). This makes it a normal \ after the eval, which is interpreted as space embedded at the value, instead of splitting the arguments. For instance, if $revs is one item FOO, and $files is one file "BAR BAZ", then it should do the equivalent (in shell) of printf "FOO\n--\nBAR BAZ" | git ... --stdin, but without escaping the space it would do the equivalent of printf "FOO\n--\nBAR" | git ... --stdin BAZ.

Hopefully needless to say, there are many more ways to escape an eval other than newlines, backslashes, and spaces, and which these commits don't take into account. At the very least also tab (which has the same effect as space, but it's currently not escaped), and indeed a filename with tab added to git, and then running gitk with this name as argument results in an error (only tried on linux)

But it likely doesn't end there. Think what various other things can do at a value which open eval's (e.g. the file name), and which I'm refraining from mentioning explicitly. On linux, while I was not able to make gitk master do something I wanted via a file in git, because it broke gitk elsewhere before it reached this open, and it's also not really my thing and I didn't try too hard, I'd guess it is possible to reach this open, and then the sky is the limit.

This is also the reason why $args print differently before and after the sanitation which was added to open and exec, as mentioned at #4 (comment) . That PR probably didn't intoruce a bug, and it effectively probably doesn't touch anything other the command name (but I did not review it), but that difference did indicate that something fishy might be going on with the arguments value which it processes (and why I thought it might break with space, but it worked, and I missed that it's addressed at escape_filter_paths, and left it for another time).

The solution is quite simple, and was explained at #7 (comment) : any value which we want to preserve unmodified after eval (in this case which open does) should be encapculated as a list item.

(or all the command + arguments as one flat list - but which we can't do in this case, and also, tcl 8.5 introduced the {*} arguments expansion to avoid some evals, but here we can't avoid the eval anyway because this is how open works, and also presumably gitk supports 8.4 too).

So there's no need for escape_filter_paths at all. Instead, the code should simply encapsulate the stdin argument as a list element ($revs and $files are lists):

[list <<[join [concat $revs -- $files] \n]]

instead of the current code which does this (and a similarly at the other place with <<...):

"<<[join [concat $revs "--" [escape_filter_paths $files]] "\\n"]"

@j6t
Copy link
Owner

j6t commented Mar 4, 2025

@avih

So there's no need for escape_filter_paths at all. Instead, the code should simply encapsulate the stdin argument as a list element ($revs and $files are lists):

[list <<[join [concat $revs -- $files] \n]]

instead of the current code which does this (and a similarly at the other place with <<...):

"<<[join [concat $revs "--" [escape_filter_paths $files]] "\\n"]"

Wow, nice catch! Do you have a patch that fixes this? It is unrelated to the topic of the current issue and would have to be a new PR.

@avih
Copy link
Contributor

avih commented Mar 4, 2025

Do you have a patch that fixes this?

Well, this is already half of the patch. The other half is the other line where <<... is used with file names.

I also searched the code for << and there's another line which does the same trick with \\n before it's used with open:

set cmd [concat $cmd --stdin "<<[join $ids "\\n"]"]

But assuming $ids are trusted git hashes, then it's not broken, even if not elegant.

With file names it's different, because they're arbitrary and come from git itself, or as CLI arguments to gitk, and it's very not-elegant to try to escape all the possible things, when a list is way simpler and should be bullet proof.

It is unrelated to the topic of the current issue and would have to be a new PR.

Yeah, not really related. If you won't push a fix yourself, I can open a PR at some stage.

@avih
Copy link
Contributor

avih commented Mar 5, 2025

(about the move from elements as arguments to stdin)

IIRC It was added originally at git-for-windows, so I can only imagine this limitation was observed on windows. No idea how much it holds true today

So, I checked where the limit comes from. This is a reproducer, from here:

# start in an empty directory
git init
for b in `seq 1 1000`; do git commit --allow-empty -m c$b; git branch b$b; done
gitk --all &
# wait until it loads
git commit --allow-empty -m another
# press F5 in gitk

In a nutshell, when running gitk --all and there are many references, it needs to pass all of them to git, and this previously hit the command length limit.

On windows the limit is 32K, on Debian it's 128K, so 1000 references (hashes) is ~ 40K, which hits the limit on windows.

These are the 3 places which do that:

        set fd [open [concat | git log --no-color -z --pretty=raw $show_notes \
                        --parents --boundary $args --stdin \
                        "<<[join [concat $revs "--" \
                                [escape_filter_paths $files]] "\\n"]"] r]

(...)

        set fd [open [concat | git log --no-color -z --pretty=raw $show_notes \
                        --parents --boundary $args --stdin \
                        "<<[join [concat $revs "--" \
                                [escape_filter_paths \
                                        $vfilelimit($view)]] "\\n"]"] r]

(...)

        if {$ids eq "--all"} {
            set cmd [concat $cmd "--all"]
        } else {
            set cmd [concat $cmd --stdin "<<[join $ids "\\n"]"]
        }
        set fd [open $cmd r]

Only $revs or $ids (the hashes) can be big, and the files are used to limit the view and are typically few.

In the first case the files come only from the gitk CLI arguments and/or from the "view editor" files field as far as I can tell, so these are few.

In the 2nd case I'm not sure, but $vfilelimit(...) are already used as argument to git in many other places in gitk master (rather than as stdin like at the 2nd case), so this is also unlikely to hit the limit.

At which case we can restore the file names to become arguments to git, but keep the hashes as stdin to git.

E.g. for the first case (confirmed working):

        set fd [open [concat | git log --no-color -z --pretty=raw $show_notes \
                        --parents --boundary $args --stdin -- $files \
                        [list <<[join $revs \n]]] r]

And the 2nd case (not confirmed because I don't know how to reach this line, but it's identical form and command like the first case, maybe up to different files, so it should work):

        set fd [open [concat | git log --no-color -z --pretty=raw $show_notes \
                        --parents --boundary $args --stdin -- $vfilelimit($view) \
                        [list <<[join $revs \n]]] r]

So, if we take this approach and restore the files to become arguments but keep the hashes as stdin to git, then:

  • This doesn't bring back the arguments limit issue, because the (maybe many) hashes stay as stdin to git, and the files are few.
  • No longer need to escape the paths and no need for escape_filter_paths, because this is only needed when the paths are at stdin to git (though we can also avoid the need to escape using the cleaner form of stdin argument encapsulation - even if we did keep the paths at stdin).
  • No need to control the encoding of stdin to git (it now only includes hashes and not paths), which means we no longer need the global utf-8 encoding (which IIRC was needed specifically for paths-in-stdin, while for the other cases we did find an alternative solution, like changing exec to open and changing the fd encoding).
  • it probably would still be better to apply some generic solution to make streams utf-8 on windows by default instead of changing specific ones utf-8, because this is what git speaks, but not gitk on windows (and then exec output also becomes utf-8, which otherwise we can't control), e.g. like in this suggestion.
  • While at it, we can create the hash-list at stdin more cleanly as [list <<[join $revs \n]] (in all 3 cases) instead of the hacky-ish <<[join $revs \\n], where the cleaner form is also more correct in a generic way if for whatever reasons $revs or $ids end up containind spaces or other weird chars.

I think it's viable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants