-
Notifications
You must be signed in to change notification settings - Fork 9
gitk: Fixing file name encoding issues. #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Kazuhiro Kato <[email protected]>
Signed-off-by: Kazuhiro Kato <[email protected]>
When resolving merge conflicts, Japanese file names become garbled. Branch A: Change from 1 to 2
invalid filename fix then. |
link git/git#1886 |
I cannot reproduce this behavior. The file changes are displayed in both cases. My locale here is en_US.utf8. The file name is encoded in UTF-8 (I assume), and I am running the commands from an xterm. What is different in our setup? |
On windows, this doesn't have an issue for me. All 3 places marked in red at the image show the same text here (I can't read it, but it looks the same text like at the browser in this issue).
This does happen for me too. Specifically, the diff at the bottom pane is missing only when using the Japanese file name. |
So, this is Windows-specific? What is the explanation for this behavior? Is it that the diff command issued by Gitk does not produce output, because it did not pass a file name to |
I don't know which platform OP uses. I only did a quick test on windows, and not with gitk master. I'll test again on windows and linux with several combinations of shell and gitk, and report back, hopefully soon. |
This review system is b... cr.... unusable. I'll paste the commented diff here:
This part populates I would prefer to have the raw name in Please explain why this change is needed:
It is about |
Please look at the history of this repository and the commit titles. Use the prefix Generally, write the status quo (where the bug exists) in present tense, and then state what has to change in imperative mood. |
Part of this change may become unnecessary once #4 is merged. If you want, you can base your changes on that branch. Notice that the encoding commands now call for |
I don't know, but if you tell me which git commands to run at the command line (posix shell), I can tell you what the output is. Feel free to request as many commands as you think would help you get a better understanding. So far I confirmed that gitk master has the same result as my report above ( I also applied the commit "gitk: Unicode file name support" from #4, but that did not change the result - the diff with the jap filename argument is still missing. (I'm not seeing other commits in that PR which could affect encoding issues). EDIT: If I apply the two commits in this PR instead of the commit from #4, then it DOES fix the missing diff issue with the jap file name arg (I can't judge the correctness of this PR, but the result here seems as intended). EDIT2: the first commit in this PR alone (only adding So far all the above are on Windows using the unicode variant of busybox-w32 shell+env (and "git" itself at the PATH is from the git-for-windows project). tcl/tk are 8.6.16 which I cross-compiled on Linux for windows. Specifically, I also didn't yet test with the git-for-windows bash environment (I don't expect differences, but I'll report once I test that and linux too). |
Yeah, as expected, git-bash.exe from the git-for-windows project (with tcl/tk which comes with it) behaves identically to as described above, including the two "EDIT" comments. That's git 2.46.2, same version which was used with the tests above using the busybox-w32 shell. The missing diff issue also happens with gitk which is shipped with that git-for-windows (which I presume is slightly behind gitk master currently). But as noted, same missing diff with gitk master, and same fixed using the 1st commit in this PR. On linux (Ubuntu, git 2.34.1, all the non-empty I didn't test the merge conflict case with any system, and I don't have further data points to share, unless requested specifically. |
@avih Thank you for doing the tests. I patched Gitk to tell the commands it runs. Here are the interesting ones:
(Replace the commit ID by The issue is that the Japanese text enters Gitk as a command line argument. I don't know whether Tcl picks the Unicode version of the arguments on Windows and keeps them, or if it picks the "ANSI" version and translates them to Unicode. You are in big trouble if it is the latter, but the "ANSI" codepage is unable to hold the Unicode characters typed into the terminal. I am trying to understand if that is the problem. But I still do not understand why As long as there is no clear reasoning why we must dismiss such a warning, a change that overrides the system encoding is not acceptable. |
Based on the information from https://www.tcl.tk/man/tcl/TclCmd/encoding.htm
In the Gitk code around line 12000, add
the following before
This ensures that command line parameters passed when calling Git commands are treated as UTF-8 instead of SJIS (or ANSI) . |
How so? Furthermore, Gitk itself does not make system calls where the names of files in the repository are used. All interaction with the repository goes via Ah, indeed, things may go wrong at this stage. The commands may still look correct on the Tcl side (like those I posted above). How do these strings arrive in Git? Are they corrupted without
Now I am confused. Does it fix the problem that we are trying to solve here? |
So, I started with this sequence, which the the first reproduce sequence above, but more complete to ensure initial state (that's also how I tested before, but adding now for completeness. I don't know whether the fact that only 1 commit exists affects this issue): rm -rf tmp && mkdir tmp && cd tmp
git init
echo "ABC" > ABC.txt
echo "あいう" > あいう.txt
git add *
git commit -m "ABC あいう"
gitk And it does show the two modified files (both at the files pane, and at the diff pane), with correct names as far as I can tell. I.e. both files were added correctly to the repo. For reference, I then ran Then I used your commands, but with HEAD instead of an explicit hash (exactly as shown below, at the shell prompt '$', let me know if you think explicit hash is needed instead):
So it "succeeded", but assuming the last command is supposed to show the diff of that file, it's empty. Just as a reference, I tried also the other file name: $ git diff-index --cached --ignore-submodules=dirty HEAD -- ABC.txt
$ echo $?
0 So this also shows an empty diff?! I presumed this command is supposed to show the diff (which gitk would then display at the diff pane), right? If not, can you tell me which command gitk executes in order to get the dif part when running (or better yet, share your patch to add debug messages of the invoked git commands). The same complete sequence and identical result (up to the hash printed from
So this probably rules out differences in environments which use git and gitk, as well as different tcl/tk builds, at least as far as the missing-diff issue is concerned. |
Also for reference: |
@avih Appologies, the commands that actually produce the patch text are:
I dismissed them on my first reading because I usually have the option Limit diff to listed paths disabled, and then there was no file name. The diff is simply
and then you invoke However, the interesting question is how the command arguments actually arrive in |
Assuming this hash is HEAD (and Not sure if relevant, but it works for me also without
Thanks. I applied it and indeed I see the used commands printed at the terminal while running For future reference, using Anyway, the output with the japanese name argument is interesting. With current gitk master (4a6cc6a), this name appears in gibberish form, same in all places:
Note that there's no line-wrapping of the first command. These seem like actual two newlines inside the output of that command. This is the bug use case, and the diff is indeed missing. Anyway, next. If I also add the first commit in this PR (that one line system encoding), then it shows correctly at the console:
And now the diff does show at the gitk windows. Another interesting data point: If I use gitk master, where the filename at the output appears gibberish, but pipe it into FYI, So it looks like that (obviously we don't strictly care about the console output, but it may indicate some difference with that patch applied, either with the value or representation of the arguments to I'm not sure how to interpret that, but hopefully you would have some insights. EDIT: another data point: with the encoding patch applied where the console output shows correctly without
Yes. I can test more things if you want me, but I think we can rule out a bug in git where it fails to accept unicode names as arguments, and I think the evidence suggest some encoding mis-handling (or a tcl bug) between gitk and tcl/tk. |
Also, not sure if applicable here, but regarding the newlines at the first command, windows does not consider newline as argument separator when parsing command line (you can't test it from cmd.exe, but you can with the So maybe it's worth removing these newlines, if they are indeed included inside one or more of the argumets which gitk uses when invoking git. |
Unrelated concretely to the code here, here's a roughly accurate overview of encoding subjects with windows:
So from all the above, if someone sticks to the Unicode APIs, which I hope is what tcl/tk do internally on Windows, then there should not be any encoding issues as far as CLI arguments to tcl (and gitk) go, and similarly no issues by default when invoking git with unicode arguments. No encoding parameters should be observed, used, or modified. It just works. I'm very weak with tcl/tk, so if someone could post a 1-2 lines test tcl program which "forwards" it's CLI arguments to git (without touching or specifying any specific encoding), then I can try it and confirm that it works out of the box with Unicode arguments, or that it doesn't work, and then we'd need to figure out what's missing. But let's assume that it works out of the box without any encoding issues. What remains to handle is the encoding of streams. Specifically in our context, at the very least stdout from git which is captured by gitk and processed further. Streams are an issue as far as encoding go, because the internal windows UTF16-LE is highly uncommon to write to stdout, or read from stdin. The obvious alternative encoding to use, if we wish to retain portability between applications without data loss (assuming the string is representable in Unicode), is UTF-8, which is very effective in all inter-process streams. These are stdin/out/err, but also content of text files on disk, etc, and so First problem with UTF-8, however, is that typically this is not the system ANSI codepage, which Windows has for whatever reasons (at least legacy, DOS compat, but maybe more). So now the application faces an issue: it won't use UTF16-LE because no one expects that. It could use the system ANSI codepage encoding for this stream, knowing that data-loss is guaranteed in some cases, or it could decide to ignore the system ANSI codepage and use UTF-8 anyway. Many modern portable applications do just that - use UTF8 stdin/out even if it doesn't match the system ANSI codepage. I think this is what git does too, at the very least when the respective stream is not a TTY. But TTY is the other issue with encoding, if one is not careful. Console Unicode APIs do exist, and using those would guarantee correct unicode display of unicode strings regardless of the console codepage. Same goes for input typed into the console. (up to missing glyphs which affect the display, but the unicode values still pass correctly via the APIs regardless). So the right thing which a modern portable application should do on windows when printing to stdout is to use the console Unicode APIs if stdout is a TTY, or print UTF8 otherwise. Same with stdin - use the Unicode API if it's a tty, else assume it's UTF8. Whether the application uses I believe "the other way around" is more common with portable applications, because then internally it can keep using C strings everywhere, which remains compatible on all platforms, but at the windows APIs edges it's converted to/from windows Lesser applications can do other things. Ranging from using only ANSI Windows APIs in general (so unicode CLI arguments are not guaranteed to work, not even with The question is where do git, gitk, and tcl stand in this landscape. Hopefully everything works out of the box between gitk and git with any Unicode arguments, and that the only non-obvious thing which gitk should do is know that the output from git's stdout is UTF8, and take actions accordingly. EDIT: there's another option to remain portable, and that's to specify that the ANSI APIs are UTF8. This is a magical solution for almost everything. |
pull request move to #9 |
see git file name encoding UTF-8 documents. |
I think there's still an issue which my unicode test script didn't excercise, and it fails due to the quoted comment subject, and so far the only solution I have for it is via the system encoding, and setting the fd encoding is not enough. This is an issue after the "fix" above which configures the fd as utf8. The test script at #7 (comment) tested, among others, that it can read UTF-8 from git's stdout, and concluded that setting either the system encoding or the fd encoding to utf-8 makes it read the correct unicode value (and then it can open it as file etc). What it didn't test, however, is writing unicode to git's stdin, and I think it fails when the stdin string has correct unicode value. What that EDIT comment says is that with default (ANSI) system encoding, if we read git's stdout into a variable, and then feed this variable back to git's stdin ("full circle use case"), then it seems to work, but the variable has incorrect unicode value. That's because any arbitrary stream of bytes (maybe except byte value 0) is a valid string in the single-byte encodings (in my case cp1252). So if we don't set the system encoding, and read git's stdin, then it's interpreted incorrectly as cp1252, but it's still a valid cp1252 byte stream, so no information is lost when it becomes (wrong) unicode string in tcl. Then, when we feed it back into git's stdin using Now, if we fix the first read part by configuring the fd to utf-8, so that tcl has a correct unicode value, and then we try to write it to git's stdin, then, like before, it translates it into a cp1252 stream for git's stdin. But now, because the unicode string is correct, but the default stream encoding is still cp1252, then it no longer translates to the original byte-stream which we read, and now git doesn't understand the value. What we need here is to tell The problem is, I don't think there's a way to do that. We can do it while reading because we have an fd which we can configure, but we can't configure anything at the output stream which I said previously that the fixed log now looks correct, and this is true
But what it doesn't show is the byte stream which git sees, and whether git was able to digest it correctly, and I think it doesn't, or at least not always. I'm not clear exactly about the details, but if I add to git a new file I know, it looks like a spaces-in-filename issue and not an encoding issue, but if I do the same with a file Additionally, if I set the system encoding to utf-8, then the I'm baffled, but it might be a weird combo of incorrect arguments to There might be some relation to #4 (comment) but I don't know currently what's noise and what's related. Still investigating, and any insight others have might help. I should probably get back to the test script and add this operation too, to test it in isolations and understand whether there are new factors in play. |
So this is correct. With default (ANSI) encoding and correct unicode variable (because we configured fd to utf8), when using open or exec with So for
This is also correct, but for a weird reason, and gitk doesn't do the right thing with the unicode name either way, as described above. But somehow git finds a commit with one of the names, but not with the other, despite both names in its stdin being equally garbage. From shell, with correct names, same arguments which gitk uses, and correct utf8 stdin for git: $ printf %s\\n 02aa13611f2a068bbce53e336d4a5bc29fac679f -- "あいう.txt" \
| git log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin | head -1
commit f3431b1756435a4bddb3627064c9ab34fbd54dd9
$ printf %s\\n 02aa13611f2a068bbce53e336d4a5bc29fac679f -- "spaces あいう.txt" \
| git log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin | head -1
commit 4baac406841cb30dcdbb05acccc0556f5052c154 From shell, with identical garbage names which gitk passes as stdin to git when the unicode variable has correct name: $ printf %s\\n 02aa13611f2a068bbce53e336d4a5bc29fac679f -- "???.txt" \
| git log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin | head -1
commit f3431b1756435a4bddb3627064c9ab34fbd54dd9
$ # why did that "work"? with the same commit-hash result like with the correct name...
$ printf %s\\n 02aa13611f2a068bbce53e336d4a5bc29fac679f -- "spaces ???.txt" \
| git log --no-color -z --pretty=raw --show-notes --parents --boundary --stdin | head -1
# but this didn't ?! Anyway, gitk doesn't do the right thing, and I also confirmed it by adding an identical So this is problem which needs to be fixed, and at least one solution is global utf8 system encoding, which makes everything just work. It might be possible to use the same approach like the patch fixes git's stdout, but also for stdin, because I believe the fd from open is bidirectional, so we can run the I didn't test that, and I also suspect this may lead to deadlock if the stdin value is very big, unless we starts doing alternate non-blocking reads/write with that fd (e.g. if git processes each input line as it reads it, and has some output which it expects us to read, but we're currently writing, then git is blocked because we're not reading it, and we're not writing because git is blocked and doesn't read). The deadlock scenario might be possible with |
By the way, I know that at least one of the new
and line 8183 is a bit different from gitk master due to the additional debug code I added. In master it's at I didn't try to follow where this value come from, or how it works, but I do find it unexpected, because in my test script such conversion didn't work, at least not when the value was read from git's stdout. For reference, here's my wrapper and proc dbg_lnum {{lvl 0}} {
set line 1
while {$line == 1} {
incr lvl -1
set line [dict get [info frame $lvl] line]
}
return $line
}
proc dbg {data {linelev 0} {title ""}} {
if {$title == ""} {set title [dict get [info frame -1] proc]}
set line [dbg_lnum [expr $linelev - 1]]
puts stderr "$title\([format %5s $line]) $data"
}
rename encoding real_encoding
proc encoding {opt {enc ""} {data ""}} {
if {$enc == ""} {
set r [real_encoding $opt]
# dbg [list $opt --> $r] -1
return $r
} else {
set r [real_encoding $opt $enc $data]
dbg [list $opt $enc $data --> $r] -1
return $r
}
} |
Can you take a look at the following commit code? With this,
All of them now work as expected. |
Interesting, where does this come from? It's from today, and the author is not you or me... EDIT: oops, it is you! I thought this comment was by @j6t . sorry! I did look at those exact places too, and even traced the escape thing back to the original reported bugs (not right now, yesterday), so I have ready opinions about these already: - lappend escaped [string map {\\ \\\\ "\ " "\\\ "} $path]
+ lappend escaped [string map {\\ "/" } $path] This was the obvious alternative fix IMO too (are the quotes required for But the issue which this is supposed to fix, not sure if 100% correctly, is when invoked from a cmd.exe shell, where two things can happen:
The examples you posted of the backslash-in-path don't excercise the second map pair, which I have to presume is there for a reason (not enough info at the bug report or commit message which added those. I don't have the link to the report, but it wasn't too hard to track it, starting with git blame). One way to generate inputs which supposedly this tries to fix is to tab-complete in cmd.exe names with spaces I know that this code is only used with paths, but the same issue might also happen with arbitrary user input (but I don't know whether gitk expects arbitrary CLI args which are not paths). - --parents --boundary $args --stdin \
- "<<[join [concat $revs "--" \
- [escape_filter_paths $files]] "\\n"]"] r]
+ --parents --boundary $args \
+ [list $revs "--" [escape_filter_paths $files]]] r]
[...]
- --parents --boundary $args --stdin \
- "<<[join [concat $revs "--" \
- [escape_filter_paths \
- $vfilelimit($view)]] "\\n"]"] r]
+ --parents --boundary $args \
+ [list $revs "--" [escape_filter_paths $vfilelimit($view)]] r]] I didn't look into it very carefully, but superficially it looks to me a much better construction than the original of adding newlines escape sequences as a method to encode it in a way which the "eval" (or equivalent) at "open" would then "reverse". It also looks similar to how I constructed the arguments to open and exec in my experiments, which programatically handles the preparation instead of needing to care and think about how the construction would then be interpreted with "eval". These are also the exact places which made me think how this handles spaces in names (which started when I noticed the args diff before/after the sanitation), and the reason I tested it, and that whole line of remaining issues with Unicode and stdin to git. (empirically it seems like spaces are handled correctly, or at least seem to work, but I was highly distracted by the stdin-to-git issue that I never got back to analyzing the handling of spaces). The rest look fine. Somewhat familiar. Correct me if I'm wrong, but I'm not seeing here a fix for stdin-to-git, right? I.e. this part is still broken also with this patch, and possibly even more broken than before in some cases, due to this "full circle case, value to git's stdin is correct despite unicode value being incorrect", because now it's never correct (with unicode values), but previously it was correct in some or all cases (but of course issues elsewhere due to incorrect unicode). |
Oops, I was wrong...
Actually, this looks wrong, and same for the other similar diff. It might avoid the stdin-to-git issue altogether, which I noticed just now, but I did check that too yesterday, and it was added for a reason - because it ran into limitations of command-line length, so this patch would bring back that limitation. IIRC It was added originally at git-for-windows, so I can only imagine this limitation was observed on windows. No idea how much it holds true today, or what the limit is on supported platforms, e.g. it's possible that it was added to address a Windows XP limitation, but now only win10+ is supported... Like the other commit message or bug report, IIRC there are no test cases to reproduce the issue which this is supposed to solve... |
check next commit kkato233@c43098b This is a pattern where the temporary file is explicitly created in UTF-8. |
I think it can work. Just ensure that the file is not deleted until after we're read all the output we need from Personally I'm not fond of creating temporary files routinely only to bypass the encoding issue, which became a problem only when the patch landed to workaround command line limitation issues. First, we don't know whether the original command length limit still exists today. If it doesn't, then we can revert the stdin patch and this problem goes away (but we still need to address the stdout-from-git issue). Also, because setting the system encoding solves both these problems much more elegantly. As I noted earlier, modern portable programs on windows use utf8 input/output by default, because assuming streams are whatever default ANSI codepage is lossy-by-definition. It's not strictly a bug, but I think it was a wrong design choice in tcl. Luckily, it seems there's a solution with one line at the begining of the script. This is even more true with gitk, because git is documented officially to use utf8 input/output, and that's the main process which gitk interacts with. But another issue is that the tcl docs seems generally against changing the system encoding. From my few experiments, I couldn't identify an issue on Windows when the default encoding is utf8. This includes things which the tcl docs says are affected by this, but I don't pretend to know or having tested the full potential scope of it. So my assessment is that changing it to utf8 by default on windows is 100% net win, and the docs "warning" shouldn't actually include window, or maybe even say the oposite - that it's recommended on windows, but I can't find any data sources to confirm or disprove that, and I'm not going to start reading tcl source code for that. Even so, IMO we should change the default encoding (windows only), maybe remove all the explicit utf8 conversions (though they might also be useful on other platforms), and wait for relevant bug reports. I think git-for-windows has enough gitk users that if this causes problems then it will come up hopefully sooner rather than later, and then we can re-assess it. But my hunch is that, up to leftover bugs, this would be a net win. |
Few things:
Here's the part of the log which contains the conversions (extracted as a full block, so not all lines are encoding-related). The first line number is in my modified code, the second number is in gitk upstream (8183 in my code is 8125 in gitk master, etc). This is with global encoding set to utf8, so supposedly places which convert from bad form to good form are indeed required even in this case, like the first line:
EDIT: opps, the file EDIT2: also accidentally Also note that some of the conversions are apparently of content where encoding doesn't matter, but maybe there can be other content at these lines which does require conversion. Unrelated, it might be useful to add some debug option to gitk, along the lines of what I posted earlier, but only enabled with some CLI option, like Also unrelated, one thing which global utf8 does not fix is literal utf8 strings at the script source file, which most editors would display correctly when editing the source code. These appear to always be interpreted as the system ANSI codepage, even if setting the global encoding comes before these literal strings at the source code. So there's that too... (gitk probably unaffected, but tcl scripts in general probably are) |
So this is a problem on several levels: Confirmed potential deadlock with this approach and big stdin: set x [exec cat gitk]
set fd [open "| cat" r+]
puts $fd $x
# unreachable due to deadlock in puts, but writing a short value does work Another issue is that
Confirmed. Both issues are avoided using the set x [exec cat gitk]
set y [exec cat <<$x]
puts "(($y))" or set x [exec cat gitk]
set fd [open [list | cat <<$x] r]
set y [read $fd]
puts "(($y))" which prints the gitk source code inside |
So, first, this is what tcl 8.6 docs say about changing the system encoding, which is not much:
Then, there's more info at the wiki here, adding a "glob" example, the fact that it's a per-process thing, and that it's not recommended:
Then, I talked to a very kind person at the tcl IRC channel, which seems very knowledgable (I'd guess he's a tcl developer), which said few interesting things:
For me, also inportant, that despite the apparent knowledge, and except IPC streams which he knew of (and which for us get fixed by switching globally to utf8), there was nothing obvious he could think of which would break due to this. But again, that's not a promise, just off the top of his head. So I think this is largely encouraging about changing the default on windows, and we should do that, and see what happens, and re-assess it if we get bug reports. |
This ( As noted initially, this conversion doesn't work if the value was read with the ANSI (default) encoding. If the entirety of these 3 binary streams is always utf8, then it feels more correct to configure the fd as utf8 instead of binary, and then the values are unicode-correct at an earlier stage of the data flow, and the later conversion to utf8 can be dropped - possibly in more than one place which deals with data from these streams. But this is a real "if". I think it's also possible that the paths parts are decoded as utf8, while other parts are decoded as something else (
I did a quick test with a non-BMP emoji, and it shows correctly at the logs/diff when |
from
This is the text file's content encoding. |
As the "very knowledgeable person" alluded to above (and on behalf of @tcltk, of whom I'm an admin) it's pretty accurate. Tcl's been strict about using the Possible actionsI advise either:
I also advise putting all this machinery inside a procedure so that the rest of your code can pretend it doesn't exist. You might already be doing this, and if you are then I approve strongly. (I've not looked. 😉) Other notesIf you have problems with emoji, the only supported fix will be upgrading to Tcl 9.0. While fixes were proposed for the 8.* series, we've decided to not release an 8.7 to allow us to focus our development efforts. (The changes were large enough that there was no reasonable prospect of backporting to 8.6.) Don't forget to seek supportSpecific problems should be raised either as tickets (or for tk) or reported to the mailing list. When doing so, please mention that this is in relation to gitk; it's a known software project that we want to support. |
Thanks for the comment. These suggestions are valuable. I know it's not the only suggestion, but changing encoding right before/after is obvious with exec, but less so with open, because the stream is processed by the calling code, and there's not obvious place (to me) to restore the encoding "after" while keeping it encapsulated in a way or place which the caller would not "see". But even if there is, the caller does work before the stream is closed, and this work would happen with the encoding still utf8 - which is what this solution was hoping to avoid. It could of course be refactored to not do any work before the whole stream is read and the encoding restored, effectively becomes exec, and it needs a refactor... But more than everything, all of these are potential solutions which do look potent, but IMO only after we know there is a problem with the seemingly perfect solution of changing the global encoding once to utf8 on windows. Obviously we don't know whether there are issues or not, and unfortunately it's also not easy to find an official answer other than source code analysis of the tcl itself. But so far, in all my testing (which I don't presume cover everything), I couldn't identify even a single issue... EDIT: or to be more specific, I couldn't yet identify a thing in tcl which behaves differently depending on global encoding on windows, other than streams - which is exactly the thing which we want to change to utf8 by default. |
The suggestion:
My reply:
Hmm, it might actually be as simple as that, because apparently an fd which I also tested that I also tested reading ~ 100K output from such fd after the system encoding was restored, and it remained utf8 for the whole read, so it's not that the first BUFSIZE happened to be interpreted correctly before the encoding was restored. It actually stays like that for as long as we have use for that (but can still be modified with Here's a patch for that, on top of gitk master (without any other patch which was suggested previously), which currently modifies only open and exec on Windows, and apparently fixes everything... (are there other commands in gitk which use streams?): diff --git a/gitk b/gitk
index bc9efa1..0aef523 100755
--- a/gitk
+++ b/gitk
@@ -40,6 +40,19 @@ proc is_Cygwin {} {
return $_iscygwin
}
+# run $cmd list with a specific global encoding set for the duration of $cmd.
+# useful for using with exec or open etc to get utf8 streams with git when the
+# system encoding is not utf-8, like windows, but possibly elsewhere too.
+# an fd which "open" returns remains configured as $enc even after restoring
+# the global encoding to something else, but fconfigure can still modify it.
+proc run_enc {cmd enc} {
+ set orig [encoding system]
+ encoding system $enc
+ set r [uplevel 1 $cmd]
+ encoding system $orig
+ return $r
+}
+
######################################################################
##
## PATH lookup
@@ -126,7 +139,11 @@ proc exec {args} {
}
}
set args [sanitize_command_line $args $i]
- uplevel 1 real_exec $args
+ if {[is_Windows]} {
+ run_enc [list uplevel 1 real_exec $args] utf-8
+ } else {
+ uplevel 1 real_exec $args
+ }
}
# Override `open` to avoid unsafe PATH lookup
@@ -139,7 +156,11 @@ proc open {args} {
set command_line [string trim [string range $arg0 1 end]]
lset args 0 "| [sanitize_command_line $command_line 0]"
}
- uplevel 1 real_open $args
+ if {[is_Windows]} {
+ run_enc [list uplevel 1 real_open $args] utf-8
+ } else {
+ uplevel 1 real_open $args
+ }
}
# End of safe PATH lookup stuff I also tested inside gitk itself to run the same stdin into One important thing though, is that original suggestion also included:
And I don't know whether gitk is multi-threaded... @dkfellows many thanks yet again. As far as I can tell this is ideal... Would still be nice to know whether global utf8 encoding on windows changes anything other than streams, but at least we now hopefully have a good generic solution in case the "here be dragons" documentation for system encoding is not wrong... (but my assessment for windows is still that it's wrong). |
@dkfellows Many thanks for taking the time to propose solutions. @avih thank you again for the extensive testing. If I understand the concept of the system encoding correctly, the warning not to change it is based on the assumption that the system encoding is the "language" in which Tcl communicates with the environment. This "language" is dictated by the locale on Unix and the ANSI codepage on Windows. If it is changed, the Tcl script talks in the wrong way to the world. Is this reasonable or am I way off the truth? In the case of Gitk, I have to wonder: Whom do we actually talk to and in what language? We talk to Or is this the wrong conclusion? |
A slightly different alternative to this implementation, is to grab the default encoding once on startup, and use that when restoring the original encoding instead of capturing this value at the begining of this function. Because this implementation can result in permanently changing the default encoding if this functions is called from two threads concurrently (thread A changed to utf8 and runs exec. before it completes, thread B calls this function and incorretly sets The suggested alternative is still not thread safe (and maybe that can be addressed too), but at most it would break during concurrent calls from two threads, but without a lasting side effect like the original implementation quoted above. |
Please refer to commit
|
I didn't try it, but I'm guessing it works. The hard part here IMO is/was identifying the exact problems and which approaches can address them, and then taking a good decision. I think by now we've identified many issues, which are hopefully all the issues in the context of unicode on windows, as well as few potential approaches and specific solutions. Currently IMO the decision seems between global utf8 encoding once for the whole script, or not. If the global approach decision is taken, then the patch is one line. If not, then there are several alternatives which were suggested here in terms of patches and data points, starting from handling explicitly only the cases which need a solution at one end, to the other end which makes exec and open utf8 by default without changing the global encoding or code in any other parts of gitk. As much as I do have a preferred solution, which I've already stated, the decision is by @j6t so let's wait for that, and take it from there once a decision is made. Either way, I don't think writing the actual code should be particularly challenging. It's just implementation details. |
Just as another data point, tcl 9 apparently switched to utf-8 system encoding by default on windows, instead of the ANSI codepage, at least on win10+. This doesn't solve any streams interoperability issues which the default ANSI encoding supposedly solves, and talking to an app which expects some ANSI codepage in its stdin would be broken (because tcl would output utf8, but the other app didn't suddenly start understanding utf8). It effectively pretends that the global windows system encoding is utf-8, and runs with that (it uses the utf8 manifest, which I mentioned at the bottom of this post) It still uses the Unicode windows APIs like before, and not the Note that it's possible that tcl uses internally the I've been looking at the tcl source code, but so far I couldn't identify what they are (the Personally I think it's another argument to change the global tcl encoding ourselves once and for all, just like tcl 9 decided to fake the windows global encoding. utf8 manifest in tcl9: https://github.com/tcltk/tcl/blob/7532f625dd115d13de41c9e696cdf6f1124b270c/win/tclsh.exe.manifest.in#L40 The init code which decides what the system emcoding is, using The init code has not changed between tcl8 and 9. only the utf8 manifest which changes the reply from GetACP, and which makes the Note that the utf8 manifest can be attached to tcl 8.6 pre-built binaries, and doesn't require re-compile, with identical effect (the char APIs become utf8, and GetACP returns CP_UTF8, and tcl streams would be utf8 by default out of the box on Windows). And same with any other application. It's a cheap method to make existing legacy binaries suddenly talk utf8. It's just that the default The manifest works on win10 1903+, has no effect in win7/8, and makes the app not run at all on winxp (assuming it did run on xp before the manifest was attached). This is also the main "trick" which busybox-w32 does in its unicode build (in addition to other things, like custom TTY interaction for unicode, enabling utf8-aware line editing, and few more, but the main thing is the manifest, because it does use the char APIs for the most part internally). |
Thank you. But the patch mixes two issues. The undisputed issue is that lines containing file names are not read as UTF-8 from The other part of your patch is a good solution if we do not switch to UTF-8 globally. |
Yeah, that's an interesting twist. I begin to like setting the encoding globally. |
This is a refinement of the patch here, still in case the decision is not to change the global encoding once and for all, and still on top of gitk master, with the following differences:
For clearer code construction, the additionl wrapper actually adds two levels of wrappers and not one, because the wrapper is generated programatically, and it's clearer when the generated code is small, and simply calls an existing "normal" procedure which does the actual work (same So overall, open/exec now have 3 levels of wrappers. the previous patch above has 2, gitk master has 1 wrapper level, and a week ago it had 0 wrappers. Performance impact is within the noise level as far as I can tell, and over a single gitk run with few dozens open/exec, I couldn't measure a difference between gitk master and with this patch (all runs with/without the patch took ~ 2.4 s +/- ~ 100 ms). diff --git a/gitk b/gitk
index bc9efa1..97116db 100755
--- a/gitk
+++ b/gitk
@@ -40,6 +40,47 @@ proc is_Cygwin {} {
return $_iscygwin
}
+######################################################################
+##
+## Windows: Use utf-8 streams by default, because git outputs/reads UTF-8
+##
+## Streams are affected by the tcl system encoding, but System APIs on Windows
+## are normally not, with unclear exceptions. So instead of changing the global
+## tcl encoding once and live in fear, change the encoding to utf-8 right
+## before commands which create streams, and restore it right afterwards.
+## The streams remain configured as utf-8 even after the encoding is restored,
+## but fconfigure can still change that normally.
+
+# Set system encoding to $enc, run $cmd list, restore system encoding.
+# $cmd runs at uplevel $level relative to the caller (don't use #VAL).
+# performance impact is typically at most negligible.
+# Note: not thread-safe: system encoding is per-process, not per thread.
+proc run_enc {cmd enc {level 0}} {
+ set orig_enc [encoding system]
+ encoding system $enc
+ set r [uplevel [expr {$level+1}] $cmd]
+ encoding system $orig_enc
+ return $r
+}
+
+# make command $cmd_name execute with system encoding $enc from here on
+# e.g.
+# wrap_cmd_enc foo utf-8
+# does:
+# rename foo enc_wrap_orig_foo
+# proc foo {args} {run_enc [concat enc_wrap_orig_foo $args] utf-8 1}
+proc wrap_cmd_enc {cmd_name enc} {
+ set orig_name enc_wrap_orig_$cmd_name
+ rename $cmd_name $orig_name
+ proc $cmd_name args "run_enc \[concat $orig_name \$args\] $enc 1"
+}
+
+# we need utf-8 with exec's <<... and open's fd-ret-val and <<...
+if {[is_Windows]} {
+ wrap_cmd_enc open utf-8
+ wrap_cmd_enc exec utf-8
+}
+
######################################################################
##
## PATH lookup |
There might be some inaccuracies below, as I only learned of this recently. TL;DR: since tcl 8.6.11 there's some support for non-BMP codepoints, in the sense that they most probably don't break when processed as whole strings (and maybe they did always break in 8.6.10). But However, iterating a string "chars" using There are likely several things broken or unexpected beyond the length, and I don't know the exact scope of it. I'd imagine tcl doesn't make too many promises on what should or shouldn't work in 8.6.11+. This is roughly analogous to how an old C program deals with UTF-8 input. It can work with the strings, it can tokenize based on ASCII chars (like space, etc), but not everything works as expected when trying to process individual "chars". (but I'm guessing the 8.6.11 developers had to put in considerable effort to make that analogy generally transparent) Tcl 9 supposedly doesn't have these issues, but If I understand correctly, the 8.6.11+ versions should still deal with them reasonably as long as one doesn't look too closely into these strings. I'd guess tokenization based on ASCII chars (and maybe any BMP codepoint) would still work. This is the same on unix/linux and Windows, with the normal difference that on windows that final utf8 string is translated to a |
So I found an old 8.6.7 build of mine for windows, and did some experiments. My estimation based on these is that non-BMP codepoints in utf-8 input are completely broken. They probably translate to junk internally (judging by their bytelength - 8, and string/split length - 4), and observably don't translate back to the original non-BMP utf-8 input in utf-8 output streams, and likely the same with system APIs. This would effectively make 8.6.10 BMP-only as far as utf-8 is concerned. If I understand correctly, this is also the same on unix/windows, but I didn't try 8.6.10 on linux. In gitk, non-BMP filename argument are broken with this 8.6.7 tcl/tk build (with my open/exec utf8 patch), with the emoji in the utf8 stream to git being 8 bytes instead of 4. This is most probably unfixable. Without specific arguments to gitk, the diff and filename do show at the bottom diff part, but the name is displayed incorrectly, again, probably unfixable with this version. I'm guessing it shows without argument because gitk only reads git's output in this case, without needing to communicate specific non-BMP names (but which it does need to do with a filename argument). The The So I'm guessing this makes non-BMP codepoints broken in 8.6.10 also when used as arguments to the script even on windows (and then using them with system APIs, etc), and not only with streams. It does feel unexpected to me, because I would have guessed that it should translate from windows UTF16 to tcl's BMP-only utf-8 without data-loss, and same in the other way back into windows wchar_t, but I didn't explore this further. The first part (win utf16 to tcl's utf8) presumably does work, judging by the 6-bytes utf8 "pair" mentioned above, so it's not obvious to me why it breaks the other way (when using it as direct argument to git). But IMO it's not worth exploring this version further. EDIT: the arguments-to-script not working seems to be specific in gitk, and indeed probably not in general. Because gitk uses the file name first as direct argument to git - which does seem to work, but then grabs it from the utf8 output from git's reply, and here it blows up because tcl 8.6.7 can't digest non-BMP in utf8 inputs. Then, when it tries to use it again (even as argument), it's junk value which git (or windows APIs) can't restore to the original utf16 or utf8 value. So on windows, non-BMP in CLI arguments to scripts should work with windows APIs (and probably file names, etc), but still break when trying to digest or output correct utf-8 in streams. |
This is what lappend escaped [string map {\\ \\\\ "\ " "\\\ "} $path] And this is the gist of the patch suggestd above (which also touches
But I misunderstood what It's definitely not to address quoting issues in cmd.exe (and most likely there are no such issues), and also git doesn't have issues with backslash in filename, neither as arguments nor in its stdin. Initially revs/files were passed as normal (CLI) arguments to git - which is also what the patch wanted to bring back. This was 100% good code. But it apparently bumped into command line length limit with many files and/or revision arguments, and therefore this was changed to send the revs/files to git via its stdin using the Regardless of encoding issues, this stdin now becomes another value at the argument to However, these commits didn't take everything into account. Instead, they only took care of 3 things:
Hopefully needless to say, there are many more ways to escape an But it likely doesn't end there. Think what various other things can do at a value which This is also the reason why The solution is quite simple, and was explained at #7 (comment) : any value which we want to preserve unmodified after (or all the command + arguments as one flat list - but which we can't do in this case, and also, tcl 8.5 introduced the So there's no need for [list <<[join [concat $revs -- $files] \n]] instead of the current code which does this (and a similarly at the other place with "<<[join [concat $revs "--" [escape_filter_paths $files]] "\\n"]" |
Wow, nice catch! Do you have a patch that fixes this? It is unrelated to the topic of the current issue and would have to be a new PR. |
Well, this is already half of the patch. The other half is the other line where I also searched the code for set cmd [concat $cmd --stdin "<<[join $ids "\\n"]"] But assuming With file names it's different, because they're arbitrary and come from git itself, or as CLI arguments to gitk, and it's very not-elegant to try to escape all the possible things, when a list is way simpler and should be bullet proof.
Yeah, not really related. If you won't push a fix yourself, I can open a PR at some stage. |
(about the move from elements as arguments to stdin)
So, I checked where the limit comes from. This is a reproducer, from here: # start in an empty directory
git init
for b in `seq 1 1000`; do git commit --allow-empty -m c$b; git branch b$b; done
gitk --all &
# wait until it loads
git commit --allow-empty -m another
# press F5 in gitk In a nutshell, when running On windows the limit is 32K, on Debian it's 128K, so 1000 references (hashes) is ~ 40K, which hits the limit on windows. These are the 3 places which do that: set fd [open [concat | git log --no-color -z --pretty=raw $show_notes \
--parents --boundary $args --stdin \
"<<[join [concat $revs "--" \
[escape_filter_paths $files]] "\\n"]"] r]
(...)
set fd [open [concat | git log --no-color -z --pretty=raw $show_notes \
--parents --boundary $args --stdin \
"<<[join [concat $revs "--" \
[escape_filter_paths \
$vfilelimit($view)]] "\\n"]"] r]
(...)
if {$ids eq "--all"} {
set cmd [concat $cmd "--all"]
} else {
set cmd [concat $cmd --stdin "<<[join $ids "\\n"]"]
}
set fd [open $cmd r] Only In the first case the files come only from the gitk CLI arguments and/or from the "view editor" files field as far as I can tell, so these are few. In the 2nd case I'm not sure, but At which case we can restore the file names to become arguments to git, but keep the hashes as stdin to git. E.g. for the first case (confirmed working): set fd [open [concat | git log --no-color -z --pretty=raw $show_notes \
--parents --boundary $args --stdin -- $files \
[list <<[join $revs \n]]] r] And the 2nd case (not confirmed because I don't know how to reach this line, but it's identical form and command like the first case, maybe up to different files, so it should work): set fd [open [concat | git log --no-color -z --pretty=raw $show_notes \
--parents --boundary $args --stdin -- $vfilelimit($view) \
[list <<[join $revs \n]]] r] So, if we take this approach and restore the files to become arguments but keep the hashes as stdin to git, then:
I think it's viable. |
fix: file name encoding issues.
fix: when resolving merge conflicts, japanese file names become garbled.
Cc: Johannes Sixt [email protected]