Skip to content

archive/zip: inconsistent non-ascii filename decoding #67878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ZeinabAshjaei opened this issue Jun 7, 2024 · 7 comments
Closed

archive/zip: inconsistent non-ascii filename decoding #67878

ZeinabAshjaei opened this issue Jun 7, 2024 · 7 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@ZeinabAshjaei
Copy link

Go version

go version go1.21.1 linux/amd64

Output of go env in your module/workspace:

GO111MODULE='on'
GOARCH='amd64'
GOBIN=''
GOCACHE='/xxx/.cache/go-build'
GOENV='/xxx/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOMODCACHE='/xxx/go/pkg/mod'
GOOS='linux'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVERSION='go1.21.1'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build789774400=/tmp/go-build -gno-record-gcc-switches'

What did you do?

When iterating over files within a zip archive using the Go standard library's zip package, there is an inconsistency in filename encoding. Specifically, when a file is located at the root level of the zip archive, the filename is retrieved with invalid encoding, displaying characters such as question marks instead of the original characters. However, the filename is correctly encoded, when the same file is within a folder structure in the zip archive.

What did you see happen?

Steps to Reproduce:
Create a zip archive containing files with filenames that include non-ASCII characters, such as "·".
Iterate over the files in the zip archive using the zip package in Go.
Observe the filenames retrieved when files are located at the root level versus within a folder structure.

Actual Behavior:
Filenames retrieved from files at the root level of the zip archive exhibit incorrect encoding, displaying invalid characters such as question marks. Filenames within folders in the zip archive are correctly encoded.

What did you expect to see?

Filenames retrieved during iteration should maintain consistent encoding regardless of their location within the zip archive. The original characters in the filenames, including non-ASCII characters, should be preserved.``

@seankhliao
Copy link
Member

can you provide an example and code for a reproducer?

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 7, 2024
@seankhliao seankhliao changed the title archive/zip : Inconsistent Filename Encoding Behavior in Go Zip Package Iteration archive/zip : inconsistent non-ascii filename decoding Jun 7, 2024
@ZeinabAshjaei
Copy link
Author

@seankhliao

  1. Create a zip archive containing a file named file·name.xml at the root level.
  2. Iterate over the files in the zip archive using the zip package.
  3. Observe the retrieved filename for file·name.xml.
  4. Create another zip archive, but this time place the file file·name.xml inside a folder, e.g., test/file·name.xml.
  5. Iterate over the files in the new zip archive using the zip package in Go.
  6. Observe the retrieved filename for test/file·name.xml.
func readZipFile(file *os.File,) {
	zipFile, _ := zip.OpenReader(file.Name())   // reading zip file content
	
	for _, fileEntry := range zipFile.File {  // Iterating over zip file entries
		fmt.Println(fileEntry.Name)  
	}
}

@rsc
Copy link
Contributor

rsc commented Jun 8, 2024

@ZeinabAshjaei Here is a Go program that creates Unicode files in the root and subdirectories and it seems to work fine: https://go.dev/play/p/T6tNxT1HH8M?v=gotip.

What program are you using to create the zip file? My guess is that program is writing bad zip file entries, or at least entries that are incompatible with Go's zip package. If you can attach a small example of a zip file that Go does not handle correctly, that would be helpful. Thanks.

@seankhliao seankhliao added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 8, 2024
@ZeinabAshjaei
Copy link
Author

@rsc Thanks for the investigation, I agree, It seems only the zip file I tested is not producing the correct file names. The attached zip file includes 3 png files, generated by AI.

GHTest.zip

@ianlancetaylor
Copy link
Contributor

Thanks. In the zip file you provided I see the same results using Go's archive/zip package and using unzip -l running on Linux system. In both cases I see DALL�E, where the non-UTF-8 character is \372. Do you see different results that suggest an inconsistency in archive/zip rather than in whatever is generating the zip file?

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jun 11, 2024
@rsc rsc changed the title archive/zip : inconsistent non-ascii filename decoding archive/zip: inconsistent non-ascii filename decoding Jun 28, 2024
@rsc
Copy link
Contributor

rsc commented Jun 28, 2024

This is working as intended. The archive/zip reader never attempts to translate the names found in the zip file to valid UTF-8. It simply presents the bytes in the zip file, which in the test file are "DALL\x{fa}E" as Ian said.

% hexdump -C GHTest.zip |grep DAL
0012eed0  00 00 44 00 09 00 44 41  4c 4c fa 45 20 32 30 32  |..D...DALL.E 202|
0042f6b0  44 41 4c 4c fa 45 20 32  30 32 33 2d 30 37 2d 31  |DALL.E 2023-07-1|
0042f810  00 00 00 00 00 d0 f1 29  00 44 41 4c 4c fa 45 20  |.......).DALL.E |
% 

The zip reader does set f.NonUTF8 for these names as a signal to client code that they might need to be careful.

@rsc rsc closed this as completed Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

5 participants