Skip to content

German umlauts in labels are decoded incorrectly #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hansendx opened this issue Sep 24, 2019 · 3 comments
Closed

German umlauts in labels are decoded incorrectly #48

hansendx opened this issue Sep 24, 2019 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@hansendx
Copy link
Collaborator

Maybe this is related: pandas-dev/pandas#21244

@hansendx
Copy link
Collaborator Author

hansendx commented Sep 24, 2019

Example:

{'kidgeb': {
      -6: '[
        -6
      ] Fragebogenversion mit geänderter Filterführung',
      -5: '[
        -5
      ] nicht im Fragebogen enthalten',
      -4: '[
        -4
      ] unzulässige Mehrfachantwort',
      -3: '[
        -3
      ] nicht valide',
      -2: '[
        -2
      ] trifft nicht zu',
      -1: '[
        -1
      ] keine Angabe'
    }
  }

Seems it is falsely decoded by pandas:

"Filterführung".encode("latin1").decode("utf8") == 'Filterführung'

@kwenzig
Copy link
Member

kwenzig commented Sep 25, 2019

Stata changed encoding to UTF-8 in their file format since Stata 14 (which ist dtaformat 118, c.f. https://www.stata.com/help.cgi?dtaversion and https://youtu.be/o68ZLxjw-1o). We are using the file format from Stata 12 (dta 115). One would have to check which versions can be imported by pandas.

@mpahl
Copy link
Contributor

mpahl commented Oct 1, 2019

The error is in write_json.py.
If you add "ensure_ascii=False" in line 292 of write_json.py, the output should be correct.

    with open(filename, "w") as json_file:
        json.dump(stat, json_file, indent=2, ensure_ascii=False)

mpahl added a commit that referenced this issue Oct 1, 2019
hansendx pushed a commit that referenced this issue Nov 19, 2019
hansendx added a commit that referenced this issue Nov 19, 2019
Problem
-------

Pandas decodes data with latin-1
which lead to problems with german umlauts.

Solution
--------

* Data is now encoded back to latin-1
  to minimize en/decoding errors.
* Add flag `--latin1` and `-l` to demarcate that the source files are
  latin-1 encoded.

Explanation
-----------

We have two expected cases, keeping in mind
that utf-8 output is desired.

1. Input files are utf-8 encoded: decoding with subsequent encoding
   in latin-1 will moth likely leave the original utf-8 encoding intact.
2. Input files are latin-1 or windows-1252 encoded: pandas decodes these
   files correctly, which means that we are working with utf-8 strings
   in the pandas object. This can then safely encoded to utf-8.

We also have one case that cannot be controlled for:

If the input is encoded in neither one of the expected encodings,
latin-1 de and encoding will probably also leave this encoding intact.
Since, at the moment, pandas StataReader decoding is
hard coded to latin-1, one latin-1 decode, encode circle is necessary.
Offering more flexibility in the input encoding would necessitate
rereading the output in the correct encoding and writing it back to
utf-8 which just complicates the process and slows it down.
Input should generally be kept utf-8 encoded.
hansendx added a commit that referenced this issue Nov 19, 2019
Problem
-------

Pandas decodes data with latin-1
which lead to problems with german umlauts.

Solution
--------

* Data is now encoded back to latin-1
  to minimize en/decoding errors.
* Add flag `--latin1` and `-l` to demarcate that the source files are
  latin-1 encoded.

Explanation
-----------

We have two expected cases, keeping in mind
that utf-8 output is desired.

1. Input files are utf-8 encoded: decoding with subsequent encoding
   in latin-1 will moth likely leave the original utf-8 encoding intact.
2. Input files are latin-1 or windows-1252 encoded: pandas decodes these
   files correctly, which means that we are working with utf-8 strings
   in the pandas object. This can then safely encoded to utf-8.

We also have one case that cannot be controlled for:

If the input is encoded in neither one of the expected encodings,
latin-1 de and encoding will probably also leave this encoding intact.
Since, at the moment, pandas StataReader decoding is
hard coded to latin-1, one latin-1 decode, encode circle is necessary.
Offering more flexibility in the input encoding would necessitate
rereading the output in the correct encoding and writing it back to
utf-8 which just complicates the process and slows it down.
Input should generally be kept utf-8 encoded.
hansendx added a commit that referenced this issue Nov 19, 2019
Problem
-------

Pandas decodes data with Latin-1
which lead to problems with german umlauts.

Solution
--------

* Data is now encoded back to Latin-1
  to minimize en/decoding errors.
* Add flag `--latin1` and `-l` to demarcate that the source files are
  Latin-1 encoded.

Explanation
-----------

We have two expected cases, keeping in mind
that utf-8 output is desired.

1. Input files are utf-8 encoded: decoding with subsequent encoding
   in Latin-1 will moth likely leave the original utf-8 encoding intact.
2. Input files are Latin-1 or Windows-1252 encoded: pandas decodes these
   files correctly, which means that we are working with utf-8 strings
   in the pandas object. This can then safely encoded to utf-8.

We also have one case that cannot be controlled for:

If the input is encoded in neither one of the expected encodings,
Latin-1 de and encoding will probably also leave this encoding intact.
Since, at the moment, pandas StataReader decoding is
hard coded to Latin-1, one Latin-1 decode, encode circle is necessary.
Offering more flexibility in the input encoding would necessitate
rereading the output in the correct encoding and writing it back to
utf-8 which just complicates the process and slows it down.
Input should generally be kept utf-8 encoded.
hansendx added a commit that referenced this issue Nov 19, 2019
Problem
-------

Pandas decodes data with Latin-1
which lead to problems with german umlauts.

Solution
--------

* Data is now encoded back to Latin-1
  to minimize en/decoding errors.
* Add flag `--latin1` and `-l` to demarcate that the source files are
  Latin-1 encoded.

Explanation
-----------

We have two expected cases, keeping in mind
that utf-8 output is desired.

1. Input files are utf-8 encoded: decoding with subsequent encoding
   in Latin-1 will moth likely leave the original utf-8 encoding intact.
2. Input files are Latin-1 or Windows-1252 encoded: pandas decodes these
   files correctly, which means that we are working with utf-8 strings
   in the pandas object. This can then safely encoded to utf-8.

We also have one case that cannot be controlled for:

If the input is encoded in neither one of the expected encodings,
Latin-1 de and encoding will probably also leave this encoding intact.
Since, at the moment, pandas StataReader decoding is
hard coded to Latin-1, one Latin-1 decode, encode circle is necessary.
Offering more flexibility in the input encoding would necessitate
rereading the output in the correct encoding and writing it back to
utf-8 which just complicates the process and slows it down.
Input should generally be kept utf-8 encoded.
hansendx pushed a commit that referenced this issue Nov 19, 2019
hansendx added a commit that referenced this issue Nov 19, 2019
Problem
-------

Pandas decodes data with Latin-1
which lead to problems with german umlauts.

Solution
--------

* Data is now encoded back to Latin-1
  to minimize en/decoding errors.
* Add flag `--latin1` and `-l` to demarcate that the source files are
  Latin-1 encoded.

Explanation
-----------

We have two expected cases, keeping in mind
that utf-8 output is desired.

1. Input files are utf-8 encoded: decoding with subsequent encoding
   in Latin-1 will moth likely leave the original utf-8 encoding intact.
2. Input files are Latin-1 or Windows-1252 encoded: pandas decodes these
   files correctly, which means that we are working with utf-8 strings
   in the pandas object. This can then safely encoded to utf-8.

We also have one case that cannot be controlled for:

If the input is encoded in neither one of the expected encodings,
Latin-1 de and encoding will probably also leave this encoding intact.
Since, at the moment, pandas StataReader decoding is
hard coded to Latin-1, one Latin-1 decode, encode circle is necessary.
Offering more flexibility in the input encoding would necessitate
rereading the output in the correct encoding and writing it back to
utf-8 which just complicates the process and slows it down.
Input should generally be kept utf-8 encoded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants