German umlauts in labels are decoded incorrectly #48

hansendx · 2019-09-24T15:04:11Z

Maybe this is related: pandas-dev/pandas#21244

hansendx · 2019-09-24T15:15:47Z

Example:

{'kidgeb': {
      -6: '[
        -6
      ] Fragebogenversion mit geÃ¤nderter FilterfÃ¼hrung',
      -5: '[
        -5
      ] nicht im Fragebogen enthalten',
      -4: '[
        -4
      ] unzulÃ¤ssige Mehrfachantwort',
      -3: '[
        -3
      ] nicht valide',
      -2: '[
        -2
      ] trifft nicht zu',
      -1: '[
        -1
      ] keine Angabe'
    }
  }

Seems it is falsely decoded by pandas:

"FilterfÃ¼hrung".encode("latin1").decode("utf8") == 'Filterführung'

kwenzig · 2019-09-25T07:19:43Z

Stata changed encoding to UTF-8 in their file format since Stata 14 (which ist dtaformat 118, c.f. https://www.stata.com/help.cgi?dtaversion and https://youtu.be/o68ZLxjw-1o). We are using the file format from Stata 12 (dta 115). One would have to check which versions can be imported by pandas.

mpahl · 2019-10-01T14:40:47Z

The error is in write_json.py.
If you add "ensure_ascii=False" in line 292 of write_json.py, the output should be correct.

    with open(filename, "w") as json_file:
        json.dump(stat, json_file, indent=2, ensure_ascii=False)

Problem ------- Pandas decodes data with latin-1 which lead to problems with german umlauts. Solution -------- * Data is now encoded back to latin-1 to minimize en/decoding errors. * Add flag `--latin1` and `-l` to demarcate that the source files are latin-1 encoded. Explanation ----------- We have two expected cases, keeping in mind that utf-8 output is desired. 1. Input files are utf-8 encoded: decoding with subsequent encoding in latin-1 will moth likely leave the original utf-8 encoding intact. 2. Input files are latin-1 or windows-1252 encoded: pandas decodes these files correctly, which means that we are working with utf-8 strings in the pandas object. This can then safely encoded to utf-8. We also have one case that cannot be controlled for: If the input is encoded in neither one of the expected encodings, latin-1 de and encoding will probably also leave this encoding intact. Since, at the moment, pandas StataReader decoding is hard coded to latin-1, one latin-1 decode, encode circle is necessary. Offering more flexibility in the input encoding would necessitate rereading the output in the correct encoding and writing it back to utf-8 which just complicates the process and slows it down. Input should generally be kept utf-8 encoded.

Problem ------- Pandas decodes data with Latin-1 which lead to problems with german umlauts. Solution -------- * Data is now encoded back to Latin-1 to minimize en/decoding errors. * Add flag `--latin1` and `-l` to demarcate that the source files are Latin-1 encoded. Explanation ----------- We have two expected cases, keeping in mind that utf-8 output is desired. 1. Input files are utf-8 encoded: decoding with subsequent encoding in Latin-1 will moth likely leave the original utf-8 encoding intact. 2. Input files are Latin-1 or Windows-1252 encoded: pandas decodes these files correctly, which means that we are working with utf-8 strings in the pandas object. This can then safely encoded to utf-8. We also have one case that cannot be controlled for: If the input is encoded in neither one of the expected encodings, Latin-1 de and encoding will probably also leave this encoding intact. Since, at the moment, pandas StataReader decoding is hard coded to Latin-1, one Latin-1 decode, encode circle is necessary. Offering more flexibility in the input encoding would necessitate rereading the output in the correct encoding and writing it back to utf-8 which just complicates the process and slows it down. Input should generally be kept utf-8 encoded.

hansendx assigned mpahl Sep 24, 2019

hansendx added the bug Something isn't working label Sep 24, 2019

hansendx mentioned this issue Sep 24, 2019

Tabelle "Label translations" zeigt nicht Label aus Datensatz an ddionrails/ddionrails#68

Closed

hansendx mentioned this issue Sep 26, 2019

Large rewrite might be necessary #52

Open

mpahl mentioned this issue Oct 1, 2019

Categories are read incorrectly #51

Open

mpahl added a commit that referenced this issue Oct 1, 2019

fix issue for german umlauts, see #48

e97dbf5

mpahl mentioned this issue Nov 18, 2019

Fix issue for german umlauts #53

Merged

hansendx pushed a commit that referenced this issue Nov 19, 2019

fix issue for german umlauts, see #48

d9aecd1

hansendx pushed a commit that referenced this issue Nov 19, 2019

fix issue for german umlauts, see #48

74e9f62

hansendx closed this as completed in 0cdef1b Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

German umlauts in labels are decoded incorrectly #48

German umlauts in labels are decoded incorrectly #48

hansendx commented Sep 24, 2019

hansendx commented Sep 24, 2019 •

edited

Loading

Uh oh!

kwenzig commented Sep 25, 2019

Uh oh!

mpahl commented Oct 1, 2019

Uh oh!

German umlauts in labels are decoded incorrectly #48

German umlauts in labels are decoded incorrectly #48

Comments

hansendx commented Sep 24, 2019

hansendx commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwenzig commented Sep 25, 2019

Uh oh!

mpahl commented Oct 1, 2019

Uh oh!

hansendx commented Sep 24, 2019 •

edited

Loading