Skip to content

perfect_float feature for deserialization accuracy #541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,17 @@ matrix:
- cargo test --features arbitrary_precision
- cargo test --features raw_value
- cargo test --features unbounded_depth
- cargo test --features perfect_float

- rust: stable
- rust: beta

- rust: 1.31.0
script:
- cargo build
- cargo build --features preserve_order
- cargo build --features arbitrary_precision

- rust: 1.36.0
script:
- cargo check --manifest-path tests/crate/Cargo.toml --no-default-features --features alloc
Expand All @@ -27,3 +34,4 @@ script:
- cargo build
- cargo build --features preserve_order
- cargo build --features arbitrary_precision
- cargo build --features perfect_float
8 changes: 8 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ appveyor = { repository = "serde-rs/json" }
serde = { version = "1.0.60", default-features = false }
indexmap = { version = "1.2", optional = true }
itoa = { version = "0.4.3", default-features = false }
lexical-core = { version = "0.7", optional = true, features = ["format"] }
ryu = "1.0"

[dev-dependencies]
Expand Down Expand Up @@ -63,6 +64,13 @@ preserve_order = ["indexmap"]
# written back to a JSON string without loss of precision.
arbitrary_precision = []

# Deserialize floats with perfect accuracy and consistent rounding. This
# changes the implementation for parsing both floats and integers. This
# implementation is slower than the default and contains a lookup table that
# increases artifact size. It does not affect serialization. This
# feature has no effect if arbitrary_precision is enabled.
perfect_float = ["lexical-core"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using perfect_float by default. @dtolnay will ultimately decide, however. Correctness should always be favored by default, and performance at the cost of correctness should be opt-in, not opt-out.


# Provide a RawValue type that can hold unprocessed JSON during deserialization.
raw_value = []

Expand Down
120 changes: 120 additions & 0 deletions src/de.rs
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ impl<'a> Deserializer<read::StrRead<'a>> {
}
}

#[cfg(not(feature = "perfect_float"))]
macro_rules! overflow {
($a:ident * 10 + $b:ident, $c:expr) => {
$a >= $c / 10 && ($a > $c / 10 || $b > $c % 10)
Expand Down Expand Up @@ -304,6 +305,7 @@ impl<'de, R: Read<'de>> Deserializer<R> {
self.fix_position(err)
}

#[cfg(not(feature = "perfect_float"))]
fn deserialize_prim_number<V>(&mut self, visitor: V) -> Result<V::Value>
where
V: de::Visitor<'de>,
Expand All @@ -330,6 +332,33 @@ impl<'de, R: Read<'de>> Deserializer<R> {
}
}

#[cfg(feature = "perfect_float")]
fn deserialize_prim_number<V>(&mut self, visitor: V) -> Result<V::Value>
where
V: de::Visitor<'de>,
{
let peek = match tri!(self.parse_whitespace()) {
Some(b) => b,
None => {
return Err(self.peek_error(ErrorCode::EofWhileParsingValue));
}
};

let value = match peek {
b'-' => {
self.eat_char();
tri!(self.parse_any_number(false)).visit(visitor)
}
b'0'..=b'9' => tri!(self.parse_any_number(true)).visit(visitor),
_ => Err(self.peek_invalid_type(&visitor)),
};

match value {
Ok(value) => Ok(value),
Err(err) => Err(self.fix_position(err)),
}
}

serde_if_integer128! {
fn scan_integer128(&mut self, buf: &mut String) -> Result<()> {
match tri!(self.next_char_or_null()) {
Expand Down Expand Up @@ -380,6 +409,7 @@ impl<'de, R: Read<'de>> Deserializer<R> {
Ok(())
}

#[cfg(not(feature = "perfect_float"))]
fn parse_integer(&mut self, positive: bool) -> Result<ParserNumber> {
let next = match tri!(self.next_char()) {
Some(b) => b,
Expand Down Expand Up @@ -428,6 +458,7 @@ impl<'de, R: Read<'de>> Deserializer<R> {
}
}

#[cfg(not(feature = "perfect_float"))]
fn parse_long_integer(
&mut self,
positive: bool,
Expand Down Expand Up @@ -455,6 +486,7 @@ impl<'de, R: Read<'de>> Deserializer<R> {
}
}

#[cfg(not(feature = "perfect_float"))]
fn parse_number(&mut self, positive: bool, significand: u64) -> Result<ParserNumber> {
Ok(match tri!(self.peek_or_null()) {
b'.' => ParserNumber::F64(tri!(self.parse_decimal(positive, significand, 0))),
Expand All @@ -476,6 +508,7 @@ impl<'de, R: Read<'de>> Deserializer<R> {
})
}

#[cfg(not(feature = "perfect_float"))]
fn parse_decimal(
&mut self,
positive: bool,
Expand Down Expand Up @@ -516,6 +549,7 @@ impl<'de, R: Read<'de>> Deserializer<R> {
}
}

#[cfg(not(feature = "perfect_float"))]
fn parse_exponent(
&mut self,
positive: bool,
Expand Down Expand Up @@ -573,6 +607,7 @@ impl<'de, R: Read<'de>> Deserializer<R> {

// This cold code should not be inlined into the middle of the hot
// exponent-parsing loop above.
#[cfg(not(feature = "perfect_float"))]
#[cold]
#[inline(never)]
fn parse_exponent_overflow(
Expand Down Expand Up @@ -626,6 +661,7 @@ impl<'de, R: Read<'de>> Deserializer<R> {
}

#[cfg(not(feature = "arbitrary_precision"))]
#[cfg(not(feature = "perfect_float"))]
fn parse_any_number(&mut self, positive: bool) -> Result<ParserNumber> {
self.parse_integer(positive)
}
Expand All @@ -640,6 +676,88 @@ impl<'de, R: Read<'de>> Deserializer<R> {
Ok(ParserNumber::String(buf))
}

#[cfg(feature = "perfect_float")]
fn parse_any_number(&mut self, positive: bool) -> Result<ParserNumber> {
let mut float = false;
self.scratch.clear();
if !positive {
self.scratch.push(b'-');
}

loop {
match self.peek_or_null()? {
c @ b'0'..=b'9' => {
self.eat_char();
self.scratch.push(c);
}
c @ b'e' | c @ b'E' | c @ b'+' | c @ b'-' | c @ b'.' => {
self.eat_char();
self.scratch.push(c);
float = true;
}
_ => break,
}
}
if self.scratch.len() > 20 {
float = true;
}
match (float, positive) {
(true, _) => {
match lexical_core::parse_format::<f64>(
&self.scratch[..],
lexical_core::NumberFormat::JSON,
) {
Ok(val) => {
if val.is_infinite() {
Err(self.error(ErrorCode::NumberOutOfRange))
} else {
Ok(ParserNumber::F64(val))
}
}
Err(err) => match err.code {
lexical_core::ErrorCode::ExponentWithoutFraction
| lexical_core::ErrorCode::MissingExponentSign
| lexical_core::ErrorCode::EmptyFraction
| lexical_core::ErrorCode::EmptyMantissa
| lexical_core::ErrorCode::EmptyExponent
| lexical_core::ErrorCode::MissingMantissaSign => {
Err(self.error(ErrorCode::EofWhileParsingValue))
}
_ => Err(self.error(ErrorCode::InvalidNumber)),
},
}
}
(false, false) => {
match lexical_core::parse_format::<i64>(
&self.scratch[..],
lexical_core::NumberFormat::JSON,
) {
Ok(val) => Ok(ParserNumber::I64(val)),
Err(err) => match err.code {
lexical_core::ErrorCode::Empty => {
Err(self.error(ErrorCode::EofWhileParsingValue))
}
_ => Err(self.error(ErrorCode::InvalidNumber)),
},
}
}
(false, true) => {
match lexical_core::parse_format::<u64>(
&self.scratch[..],
lexical_core::NumberFormat::JSON,
) {
Ok(val) => Ok(ParserNumber::U64(val)),
Err(err) => match err.code {
lexical_core::ErrorCode::Empty => {
Err(self.error(ErrorCode::EofWhileParsingValue))
}
_ => Err(self.error(ErrorCode::InvalidNumber)),
},
}
}
}
}

#[cfg(feature = "arbitrary_precision")]
fn scan_or_eof(&mut self, buf: &mut String) -> Result<u8> {
match tri!(self.next_char()) {
Expand Down Expand Up @@ -742,6 +860,7 @@ impl<'de, R: Read<'de>> Deserializer<R> {
Ok(())
}

#[cfg(not(feature = "perfect_float"))]
fn f64_from_parts(
&mut self,
positive: bool,
Expand Down Expand Up @@ -1022,6 +1141,7 @@ impl FromStr for Number {
}
}

#[cfg(not(feature = "perfect_float"))]
static POW10: [f64; 309] = [
1e000, 1e001, 1e002, 1e003, 1e004, 1e005, 1e006, 1e007, 1e008, 1e009, //
1e010, 1e011, 1e012, 1e013, 1e014, 1e015, 1e016, 1e017, 1e018, 1e019, //
Expand Down
76 changes: 58 additions & 18 deletions tests/test.rs
Original file line number Diff line number Diff line change
Expand Up @@ -732,19 +732,9 @@ fn test_parse_number_errors() {
("0x80", "trailing characters at line 1 column 2"),
("\\0", "expected value at line 1 column 1"),
("1.", "EOF while parsing a value at line 1 column 2"),
("1.a", "invalid number at line 1 column 3"),
("1.e1", "invalid number at line 1 column 3"),
("1e", "EOF while parsing a value at line 1 column 2"),
("1e+", "EOF while parsing a value at line 1 column 3"),
("1a", "trailing characters at line 1 column 2"),
(
"100e777777777777777777777777777",
"number out of range at line 1 column 14",
),
(
"-100e777777777777777777777777777",
"number out of range at line 1 column 15",
),
(
"1000000000000000000000000000000000000000000000000000000000000\
000000000000000000000000000000000000000000000000000000000000\
Expand Down Expand Up @@ -773,6 +763,38 @@ fn test_parse_number_errors() {
"number out of range at line 1 column 303",
),
]);

// the error messages here vary by implementation
test_parse_err::<f64>(&[
(
"1.a",
#[cfg(not(feature = "perfect_float"))]
"invalid number at line 1 column 3",
#[cfg(feature = "perfect_float")]
"EOF while parsing a value at line 1 column 2",
),
(
"1.e1",
#[cfg(not(feature = "perfect_float"))]
"invalid number at line 1 column 3",
#[cfg(feature = "perfect_float")]
"EOF while parsing a value at line 1 column 4",
),
(
"100e777777777777777777777777777",
#[cfg(not(feature = "perfect_float"))]
"number out of range at line 1 column 14",
#[cfg(feature = "perfect_float")]
"number out of range at line 1 column 31",
),
(
"-100e777777777777777777777777777",
#[cfg(not(feature = "perfect_float"))]
"number out of range at line 1 column 15",
#[cfg(feature = "perfect_float")]
"number out of range at line 1 column 32",
),
]);
}

#[test]
Expand Down Expand Up @@ -840,14 +862,10 @@ fn test_parse_f64() {
("0.00e+00", 0.0),
("0.00e-00", 0.0),
("3.5E-2147483647", 0.0),
(
&format!("{}", (i64::MIN as f64) - 1.0),
(i64::MIN as f64) - 1.0,
),
(
&format!("{}", (u64::MAX as f64) + 1.0),
(u64::MAX as f64) + 1.0,
),
// (i64::MIN as f64) - 1.0
("-9223372036854776000.0", -9223372036854776000.0),
// (u64::MAX as f64) + 1.0
("18446744073709552000.0", 18446744073709552000.0),
(&format!("{}", f64::EPSILON), f64::EPSILON),
(
"0.0000000000000000000000000000000000000000000000000123e50",
Expand Down Expand Up @@ -900,6 +918,28 @@ fn test_parse_f64() {
1e308,
),
]);

#[cfg(not(feature = "perfect_float"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come these don't pass with perfect_float?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtolnay The issue is because that value actually isn't the correct representation of a perfect float. That is a close-to-halfway representation, and the correct bit-pattern is actually:

0100001111110000000000000000000000000000000000000000000000000000

The annotated bit-pattern is:

Sign Exponent Significant Digits
0 10000111111 0000000000000000000000000000000000000000000000000000

As you can see, the significant digits has a 0 for the significant bit, so for anything below or equal to the halfway point (a bit set just past the significant digits), the default IEEE754 rounding scheme of round-nearest-tie-even will round down. The correct value is 18446744073709551616.0.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simple code to practice getting halfway representations, use the following code:

import numpy as np

class FloatMixin:
    '''Mixing for floating-point methods.'''

    def __init__(self, value):
        self.value = self.float_type(value)

    def to_bits(self):
        '''Extract bitwise representation of float.'''
        return np.frombuffer(self.value.tobytes(), dtype=self.int_type)[0]

    @classmethod
    def from_bits(cls, value):
        '''Extract bitwise representation of float.'''
        return cls(np.frombuffer(value.tobytes(), dtype=cls.float_type)[0])

    def to_hex(self):
        '''Convert double to hex.'''
        return '{0:0{1}x}'.format(self.to_bits(), self.value.itemsize * 2)

    def is_denormal(self):
        '''Returns true if the float is a denormal.'''
        return self.to_bits() & self.EXPONENT_MASK == 0

    def is_special(self):
        '''Returns true if the float is NaN or Infinite.'''
        return self.to_bits() & self.EXPONENT_MASK == self.EXPONENT_MASK

    def is_nan(self):
        '''Returns true if the float is NaN.'''
        return self.is_special() and self.to_bits() & self.MANTISSA_MASK != 0

    def is_inf(self):
        '''Returns true if the float is Infinite.'''
        return self.is_special() and self.to_bits() & self.MANTISSA_MASK == 0

    def exponent(self):
        '''Get exponent component from the float.'''

        if self.is_denormal():
            return self.DENORMAL_EXPONENT
        bits = self.to_bits()
        exp_bits = bits & self.EXPONENT_MASK
        biased_e = np.int32(exp_bits >> int_type(self.MANTISSA_SIZE))
        return biased_e - self.EXPONENT_BIAS

    def mantissa(self):
        '''Get mantissa component from the float.'''

        bits = self.to_bits()
        s = bits & self.MANTISSA_MASK
        if not self.is_denormal():
            return s + self.HIDDEN_BIT_MASK
        return s


class Float32(FloatMixin):
    '''Wrapper around a 32-bit floating point value.'''

    SIGN_MASK           = np.uint32(0x80000000)
    EXPONENT_MASK       = np.uint32(0x7F800000)
    HIDDEN_BIT_MASK     = np.uint32(0x00800000)
    MANTISSA_MASK       = np.uint32(0x007FFFFF)
    MANTISSA_SIZE       = np.int32(23)
    EXPONENT_BIAS       = np.int32(127 + MANTISSA_SIZE)
    DENORMAL_EXPONENT   = np.int32(1 - EXPONENT_BIAS)
    float_type = np.float32
    int_type = np.uint32


class Float64(FloatMixin):
    '''Wrapper around a 64-bit floating point value.'''

    SIGN_MASK           = np.uint64(0x8000000000000000)
    EXPONENT_MASK       = np.uint64(0x7FF0000000000000)
    HIDDEN_BIT_MASK     = np.uint64(0x0010000000000000)
    MANTISSA_MASK       = np.uint64(0x000FFFFFFFFFFFFF)
    MANTISSA_SIZE       = np.int32(52)
    EXPONENT_BIAS       = np.int32(1023 + MANTISSA_SIZE)
    DENORMAL_EXPONENT   = np.int32(1 - EXPONENT_BIAS)
    float_type = np.float64
    int_type = np.uint64

Now, to calculate b+1, the next float float from a given float b, and b+h, the halfway point, do:

b = Float64(18446744073709551616.0)
b1 = Float64.from_bits(b.to_bits() + Float64.int_type(1))
bh = (int(b.value) + int(b1.value)) // 2

In this case, for b of 18446744073709551616.0, b+1 is 18446744073709555712.0, and b+h is 18446744073709553664.0. b+h cannot be represented as an exact floating-point number, but we can confirm it is accurate with a few tests:

>>> float("18446744073709553664.0")  # b+h
18446744073709551616.0
>>> float("18446744073709553665.0")  # above b+h
18446744073709555712.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtolnay The accurate float parser does not mean every decimal representation can be exactly represented as a floating-point number, however, it does mean that the closest representation will always be used, and that these conversions are stable.

I hope my explanation makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that in general not every decimal representation can be exactly represented as a floating-point number, that is obvious to me. The part I don't think you answered is why these particular values in the test no longer round trip.

My expectation for f64::to_string is that it produces a decimal representation (hopefully with as few significant digits as possible, but this is not necessary) which is mathematically closer to the original input float than to any other float.

My expectation for f64::from_str is that it produces the float which is mathematically closest to the input digits.

This test was supposed to test that pair of expectations. Of the two expectations, is one or both untrue of the standard library? Is one or both untrue of the implementation in this PR?

For the standard library my understanding is that the following line succeeds for any input float n, and that's certainly true for the two specific floats in this test.

assert_eq!(n.to_string().parse::<f64>().unwrap(), n);

Copy link

@Alexhuszagh Alexhuszagh Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtolnay Oh I see what you mean, my mistake. However, the actual results seem to work after adding the test back in anyway:

    #[cfg(not(feature = "arbitrary_precision"))]
    #[cfg(feature = "perfect_float")]
    test_parse_ok(vec![
        ("18446744073709552000.0", 18446744073709552000.0),
        ("31.245270191439438", 31.245270191439438),
        ("121.48791951161945", 121.48791951161945),
    ]);

I'm not why it originally failed, I have extensive tests that ensure this works in my codebase (and I test on ~20 different architectures for each release).

#[cfg(not(feature = "arbitrary_precision"))]
test_parse_ok(vec![
(
// "-9223372036854776000", note it formats with no trailing ".0"
&format!("{}", (i64::MIN as f64) - 1.0),
(i64::MIN as f64) - 1.0,
),
(
// "18446744073709552000", note it formats with no trailing ".0"
&format!("{}", (u64::MAX as f64) + 1.0),
(u64::MAX as f64) + 1.0,
),
]);

#[cfg(not(feature = "arbitrary_precision"))]
#[cfg(feature = "perfect_float")]
test_parse_ok(vec![
("31.245270191439438", 31.245270191439438),
("121.48791951161945", 121.48791951161945),
]);
}

#[test]
Expand Down