-
Notifications
You must be signed in to change notification settings - Fork 1
German umlauts in labels are decoded incorrectly #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Example: {'kidgeb': {
-6: '[
-6
] Fragebogenversion mit geänderter Filterführung',
-5: '[
-5
] nicht im Fragebogen enthalten',
-4: '[
-4
] unzulässige Mehrfachantwort',
-3: '[
-3
] nicht valide',
-2: '[
-2
] trifft nicht zu',
-1: '[
-1
] keine Angabe'
}
} Seems it is falsely decoded by pandas: "Filterführung".encode("latin1").decode("utf8") == 'Filterführung' |
Stata changed encoding to UTF-8 in their file format since Stata 14 (which ist dtaformat 118, c.f. https://www.stata.com/help.cgi?dtaversion and https://youtu.be/o68ZLxjw-1o). We are using the file format from Stata 12 (dta 115). One would have to check which versions can be imported by pandas. |
The error is in write_json.py. with open(filename, "w") as json_file:
json.dump(stat, json_file, indent=2, ensure_ascii=False) |
Problem ------- Pandas decodes data with latin-1 which lead to problems with german umlauts. Solution -------- * Data is now encoded back to latin-1 to minimize en/decoding errors. * Add flag `--latin1` and `-l` to demarcate that the source files are latin-1 encoded. Explanation ----------- We have two expected cases, keeping in mind that utf-8 output is desired. 1. Input files are utf-8 encoded: decoding with subsequent encoding in latin-1 will moth likely leave the original utf-8 encoding intact. 2. Input files are latin-1 or windows-1252 encoded: pandas decodes these files correctly, which means that we are working with utf-8 strings in the pandas object. This can then safely encoded to utf-8. We also have one case that cannot be controlled for: If the input is encoded in neither one of the expected encodings, latin-1 de and encoding will probably also leave this encoding intact. Since, at the moment, pandas StataReader decoding is hard coded to latin-1, one latin-1 decode, encode circle is necessary. Offering more flexibility in the input encoding would necessitate rereading the output in the correct encoding and writing it back to utf-8 which just complicates the process and slows it down. Input should generally be kept utf-8 encoded.
Problem ------- Pandas decodes data with latin-1 which lead to problems with german umlauts. Solution -------- * Data is now encoded back to latin-1 to minimize en/decoding errors. * Add flag `--latin1` and `-l` to demarcate that the source files are latin-1 encoded. Explanation ----------- We have two expected cases, keeping in mind that utf-8 output is desired. 1. Input files are utf-8 encoded: decoding with subsequent encoding in latin-1 will moth likely leave the original utf-8 encoding intact. 2. Input files are latin-1 or windows-1252 encoded: pandas decodes these files correctly, which means that we are working with utf-8 strings in the pandas object. This can then safely encoded to utf-8. We also have one case that cannot be controlled for: If the input is encoded in neither one of the expected encodings, latin-1 de and encoding will probably also leave this encoding intact. Since, at the moment, pandas StataReader decoding is hard coded to latin-1, one latin-1 decode, encode circle is necessary. Offering more flexibility in the input encoding would necessitate rereading the output in the correct encoding and writing it back to utf-8 which just complicates the process and slows it down. Input should generally be kept utf-8 encoded.
Problem ------- Pandas decodes data with Latin-1 which lead to problems with german umlauts. Solution -------- * Data is now encoded back to Latin-1 to minimize en/decoding errors. * Add flag `--latin1` and `-l` to demarcate that the source files are Latin-1 encoded. Explanation ----------- We have two expected cases, keeping in mind that utf-8 output is desired. 1. Input files are utf-8 encoded: decoding with subsequent encoding in Latin-1 will moth likely leave the original utf-8 encoding intact. 2. Input files are Latin-1 or Windows-1252 encoded: pandas decodes these files correctly, which means that we are working with utf-8 strings in the pandas object. This can then safely encoded to utf-8. We also have one case that cannot be controlled for: If the input is encoded in neither one of the expected encodings, Latin-1 de and encoding will probably also leave this encoding intact. Since, at the moment, pandas StataReader decoding is hard coded to Latin-1, one Latin-1 decode, encode circle is necessary. Offering more flexibility in the input encoding would necessitate rereading the output in the correct encoding and writing it back to utf-8 which just complicates the process and slows it down. Input should generally be kept utf-8 encoded.
Problem ------- Pandas decodes data with Latin-1 which lead to problems with german umlauts. Solution -------- * Data is now encoded back to Latin-1 to minimize en/decoding errors. * Add flag `--latin1` and `-l` to demarcate that the source files are Latin-1 encoded. Explanation ----------- We have two expected cases, keeping in mind that utf-8 output is desired. 1. Input files are utf-8 encoded: decoding with subsequent encoding in Latin-1 will moth likely leave the original utf-8 encoding intact. 2. Input files are Latin-1 or Windows-1252 encoded: pandas decodes these files correctly, which means that we are working with utf-8 strings in the pandas object. This can then safely encoded to utf-8. We also have one case that cannot be controlled for: If the input is encoded in neither one of the expected encodings, Latin-1 de and encoding will probably also leave this encoding intact. Since, at the moment, pandas StataReader decoding is hard coded to Latin-1, one Latin-1 decode, encode circle is necessary. Offering more flexibility in the input encoding would necessitate rereading the output in the correct encoding and writing it back to utf-8 which just complicates the process and slows it down. Input should generally be kept utf-8 encoded.
Problem ------- Pandas decodes data with Latin-1 which lead to problems with german umlauts. Solution -------- * Data is now encoded back to Latin-1 to minimize en/decoding errors. * Add flag `--latin1` and `-l` to demarcate that the source files are Latin-1 encoded. Explanation ----------- We have two expected cases, keeping in mind that utf-8 output is desired. 1. Input files are utf-8 encoded: decoding with subsequent encoding in Latin-1 will moth likely leave the original utf-8 encoding intact. 2. Input files are Latin-1 or Windows-1252 encoded: pandas decodes these files correctly, which means that we are working with utf-8 strings in the pandas object. This can then safely encoded to utf-8. We also have one case that cannot be controlled for: If the input is encoded in neither one of the expected encodings, Latin-1 de and encoding will probably also leave this encoding intact. Since, at the moment, pandas StataReader decoding is hard coded to Latin-1, one Latin-1 decode, encode circle is necessary. Offering more flexibility in the input encoding would necessitate rereading the output in the correct encoding and writing it back to utf-8 which just complicates the process and slows it down. Input should generally be kept utf-8 encoded.
Maybe this is related: pandas-dev/pandas#21244
The text was updated successfully, but these errors were encountered: