Skip to content

Unit test Distribution.Utils.ShortText BinaryId fails #4644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fgaz opened this issue Jul 29, 2017 · 6 comments
Closed

Unit test Distribution.Utils.ShortText BinaryId fails #4644

fgaz opened this issue Jul 29, 2017 · 6 comments

Comments

@fgaz
Copy link
Member

fgaz commented Jul 29, 2017

https://travis-ci.org/haskell-pushbot/cabal-binaries/builds/258926423

$ Cabal/unit-tests
[snip]
  Distribution.Utils.ShortText
    ShortText Id:                                       OK
      +++ OK, passed 100 tests.
    ShortText Ord:                                      OK
      +++ OK, passed 100 tests.
    ShortText Monoid:                                   OK
      +++ OK, passed 100 tests.
    ShortText BinaryId:                                 FAIL
      *** Failed! Falsifiable (after 40 tests and 5 shrinks): 
      "\65534"
      Use --quickcheck-replay=551066 to reproduce.

ping @ezyang

@ezyang
Copy link
Contributor

ezyang commented Jul 29, 2017

CC @hvr who introduced this test in 993d20a

@fgaz
Copy link
Member Author

fgaz commented Jul 29, 2017

Wikipedia says \65534 is U+FFFE <noncharacter-FFFE> not a character.

FFFE and FFFF are not unassigned in the usual sense, but guaranteed not to be a Unicode character at all. They can be used to guess a text's encoding scheme, since any text containing these is by definition not a correctly encoded Unicode text. Unicode's U+FEFF Byte order mark character can be inserted at the beginning of a Unicode text to signal its endianness: a program reading such a text and encountering 0xFFFE would then know that it should switch the byte order for all the following characters.

Whoa.

@hvr
Copy link
Member

hvr commented Jul 30, 2017

I need to look into why a BOM (which btw makes no sense whatsoever for UTF8 encodings) doesn't round-trip properly. Iirc I specifically tested such corner-cases in the implementation of http://hackage.haskell.org/package/text-short

PS: I just noticed this is with the GHC 7.6.3 configuration, so this may be a problem with the legacy fallback...

@hvr
Copy link
Member

hvr commented Jul 30, 2017

After some investigation, the issue is in fact for the String-backed legacy fallback, whose Binary instance relies on the roundtrip property of Distribution.Utils.String.{encode,decode}StringUtf8, which fails for the BOM:

> decodeStringUtf8 ( encodeStringUtf8 "\65534")
"\65533"

because decodeStringUtf8 (imho rightfully) considers a BOM invalid in an UTF8 stream, and maps it to the replacement-character.

@hvr
Copy link
Member

hvr commented Dec 3, 2017

I'll take a stab at harmonizing the decodeStringUtf8 semantics with the more round-tripping friendly ones from text and text-short.

hvr added a commit to hvr/cabal that referenced this issue Dec 3, 2017
This changes `decodeStringUtf8` to not replace U+FFFE and U+FFFF into
U+FFFD, while `encodeStringUtf8` now replaces surrogate pairs
(i.e. code-points U+D800 through U+DFFF which are invalid in UTF-8)
with U+FFFD.

Consequently, `decodeStringUtf8 . encodeStringUtf8` can now properly
round-trip all scalar code-points
(i.e. [U+0000..U+D7FF] ∪ [U+E000..U+10FFFF]).

This should finally address haskell#4644
@hvr
Copy link
Member

hvr commented Dec 4, 2017

I'm confident this one's been fixed via #4928; I ran unit-tests -p BinaryId --quickcheck-tests 999999 compiled for GHC 7.6.3 a few times; and also tried the replay value; everything passed so far.

@hvr hvr closed this as completed Dec 4, 2017
@hvr hvr added this to the 2.2 milestone Dec 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants