Skip to content

Rework quantile iteration logic #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 18, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 140 additions & 17 deletions examples/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,20 @@
extern crate hdrsample;
extern crate clap;

use std::io::BufRead;
use std::io;
use std::io::{Write, BufRead};
use std::fmt::Display;

use clap::{App, Arg, SubCommand};

use hdrsample::Histogram;
use hdrsample::serialization::{V2Serializer, V2DeflateSerializer};
use hdrsample::{Histogram, RecordError};
use hdrsample::serialization::{V2Serializer, V2SerializeError, V2DeflateSerializer, V2DeflateSerializeError, Deserializer, DeserializeError};

fn main() {
let default_max = format!("{}", u64::max_value());
let matches = App::new("hdrsample cli")
.subcommand(SubCommand::with_name("serialize")
.about("Transform number-per-line input from stdin into a serialized histogram on stdout")
.arg(Arg::with_name("min")
.long("min")
.help("Minimum discernible value")
Expand All @@ -37,8 +40,26 @@ fn main() {
.short("r")
.long("resize")
.help("Enable auto resize")))
.subcommand(SubCommand::with_name("iter-quantiles")
.about("Display quantiles to stdout from serialized histogram stdin")
.arg(Arg::with_name("ticks")
.short("t")
.long("ticks-per-half")
.takes_value(true)
.required(true)
.help("Ticks per half distance"))
.arg(Arg::with_name("quantile-precision")
.long("quantile-precision")
.takes_value(true)
.default_value("5")))
.get_matches();

let stdin = std::io::stdin();
let stdin_handle = stdin.lock();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you don't just want let stdin = std::io::stdin().lock();?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler gets grumpy about the lifetime of stdin() if I remove the intermediate let. Perhaps there is a better way though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. This is at least a little nicer:

let stdin = std::io::stdin();
let stdin = stdin.lock();


let stdout = std::io::stdout();
let stdout_handle = stdout.lock();

match matches.subcommand_name() {
Some("serialize") => {
let sub_matches = matches.subcommand_matches("serialize").unwrap();
Expand All @@ -52,28 +73,130 @@ fn main() {
h.auto(true);
}

serialize(h, sub_matches.is_present("compression"));
},
serialize(stdin_handle, stdout_handle, h, sub_matches.is_present("compression"))
}
Some("iter-quantiles") => {
let sub_matches = matches.subcommand_matches("iter-quantiles").unwrap();
let ticks_per_half = sub_matches.value_of("ticks").unwrap().parse().unwrap();
let quantile_precision = sub_matches.value_of("quantile-precision").unwrap().parse().unwrap();
quantiles(stdin_handle, stdout_handle, quantile_precision, ticks_per_half)
}
_ => unreachable!()
}
}.expect("Subcommand failed")
}

fn serialize(mut h: Histogram<u64>, compression: bool) {
let stdin = std::io::stdin();
let stdin_handle = stdin.lock();

for num in stdin_handle.lines()
/// Read numbers, one from each line, from stdin and output the resulting serialized histogram.
fn serialize<R: BufRead, W: Write>(reader: R, mut writer: W, mut h: Histogram<u64>, compression: bool) -> Result<(), CliError> {
for num in reader.lines()
.map(|l| l.expect("Should be able to read stdin"))
.map(|s| s.parse().expect("Each line must be a u64")) {
h.record(num).unwrap();
h.record(num)?;
}

let stdout = std::io::stdout();
let mut stdout_handle = stdout.lock();

if compression {
V2DeflateSerializer::new().serialize(&h, &mut stdout_handle).unwrap();
V2DeflateSerializer::new().serialize(&h, &mut writer)?;
} else {
V2Serializer::new().serialize(&h, &mut stdout_handle).unwrap();
V2Serializer::new().serialize(&h, &mut writer)?;
}

Ok(())
}

/// Output histogram data in a format similar to the Java impl's
/// `AbstractHistogram#outputPercentileDistribution`.
fn quantiles<R: BufRead, W: Write>(mut reader: R, mut writer: W, quantile_precision: usize, ticks_per_half: u32) -> Result<(), CliError> {
let hist: Histogram<u64> = Deserializer::new().deserialize(&mut reader)?;

writer.write_all(
format!(
"{:>12} {:>quantile_precision$} {:>10} {:>14}\n\n",
"Value",
"Quantile",
"TotalCount",
"1/(1-Quantile)",
quantile_precision = quantile_precision + 2 // + 2 from leading "0." for numbers
).as_ref(),
)?;
let mut sum = 0;
for v in hist.iter_quantiles(ticks_per_half) {
sum += v.count_since_last_iteration();
if v.quantile() < 1.0 {
writer.write_all(
format!(
"{:12} {:1.*} {:10} {:14.2}\n",
v.value(),
quantile_precision,
v.quantile(),
sum,
1_f64 / (1_f64 - v.quantile())
).as_ref(),
)?;
} else {
writer.write_all(
format!(
"{:12} {:1.*} {:10} {:>14}\n",
v.value(),
quantile_precision,
v.quantile(),
sum,
"∞"
).as_ref(),
)?;
}
}

fn write_extra_data<T1: Display, T2: Display, W: Write>(
writer: &mut W, label1: &str, data1: T1, label2: &str, data2: T2) -> Result<(), io::Error> {
writer.write_all(format!("#[{:10} = {:12.2}, {:14} = {:12.2}]\n",
label1, data1, label2, data2).as_ref())
}

write_extra_data(&mut writer, "Mean", hist.mean(), "StdDeviation", hist.stdev())?;
write_extra_data(&mut writer, "Max", hist.max(), "Total count", hist.count())?;
write_extra_data(&mut writer, "Buckets", hist.buckets(), "SubBuckets", hist.len())?;

Ok(())
}


// A handy way to enable ? use in subcommands by mapping common errors.
// Normally I frown on excessive use of From as it's too "magic", but in the limited confines of
// subcommands, the convenience seems worth it.
#[derive(Debug)]
enum CliError {
IoError(io::Error),
HistogramSerializeError(V2SerializeError),
HistogramSerializeCompressedError(V2DeflateSerializeError),
HistogramDeserializeError(DeserializeError),
HistogramRecordError(RecordError)
}

impl From<io::Error> for CliError {
fn from(e: io::Error) -> Self {
CliError::IoError(e)
}
}

impl From<V2SerializeError> for CliError {
fn from(e: V2SerializeError) -> Self {
CliError::HistogramSerializeError(e)
}
}

impl From<V2DeflateSerializeError> for CliError {
fn from(e: V2DeflateSerializeError) -> Self {
CliError::HistogramSerializeCompressedError(e)
}
}

impl From<RecordError> for CliError {
fn from(e: RecordError) -> Self {
CliError::HistogramRecordError(e)
}
}

impl From<DeserializeError> for CliError {
fn from(e: DeserializeError) -> Self {
CliError::HistogramDeserializeError(e)
}
}
7 changes: 5 additions & 2 deletions src/iterators/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,9 @@ impl<'a, T: 'a, P> Iterator for HistogramIterator<'a, T, P>
return None;
}

// have we yielded all non-zeros in the histogram?
// TODO should check if we've reached max, not count, to avoid early termination
// on histograms with very large counts whose total would exceed u64::max_value()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though doesn't running_total have the same issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It absolutely does, but we could limit the damage to only saturating total_count_to_index and still continuing to iterate until we've reached the max. I've got another branch I'm working on that does this.

// Have we yielded all non-zeros in the histogram?
let total = self.hist.count();
if self.prev_total_count == total {
// is the picker done?
Expand All @@ -163,7 +165,7 @@ impl<'a, T: 'a, P> Iterator for HistogramIterator<'a, T, P>
// if we've seen all counts, no other counts should be non-zero
if self.total_count_to_index == total {
// TODO this can fail when total count overflows
assert!(count == T::zero());
assert_eq!(count, T::zero());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got tired of IntelliJ warning me that this could be assert_eq

}

// TODO overflow
Expand All @@ -182,6 +184,7 @@ impl<'a, T: 'a, P> Iterator for HistogramIterator<'a, T, P>
// exposed to the same value again after yielding. not sure why this is the
// behavior we want, but it's what the original Java implementation dictates.

// TODO count starting at 0 each time we emit a value to be less prone to overflow
self.prev_total_count = self.total_count_to_index;
return Some(val);
}
Expand Down
31 changes: 17 additions & 14 deletions src/iterators/quantile.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@ pub struct Iter<'a, T: 'a + Counter> {
hist: &'a Histogram<T>,

ticks_per_half_distance: u32,
quantile_to_iterate_to: f64,
reached_last_recorded_value: bool,
quantile_to_iterate_to: f64
}

impl<'a, T: 'a + Counter> Iter<'a, T> {
Expand All @@ -21,8 +20,7 @@ impl<'a, T: 'a + Counter> Iter<'a, T> {
Iter {
hist,
ticks_per_half_distance,
quantile_to_iterate_to: 0.0,
reached_last_recorded_value: false,
quantile_to_iterate_to: 0.0
})
}
}
Expand All @@ -42,6 +40,13 @@ impl<'a, T: 'a + Counter> PickyIterator<T> for Iter<'a, T> {
return false;
}

if self.quantile_to_iterate_to == 1.0 {
// We incremented to 1.0 just at the point where we finally got to the last non-zero
// bucket. We want to pick this value but not do the math below because it doesn't work
// when quantile >= 1.0.
return true;
}

// The choice to maintain fixed-sized "ticks" in each half-distance to 100% [starting from
// 0%], as opposed to a "tick" size that varies with each interval, was made to make the
// steps easily comprehensible and readable to humans. The resulting quantile steps are
Expand All @@ -61,29 +66,27 @@ impl<'a, T: 'a + Counter> PickyIterator<T> for Iter<'a, T> {
// slice, traverse half to get to 50%. Then traverse half of the last (second) slice to get
// to 75%, etc.
// Minimum of 0 (1.0/1.0 = 1, log 2 of which is 0) so unsigned cast is safe.
// Won't hit the `inf` case because quantile < 1.0, so this should yield an actual number.
let num_halvings = (1.0 / (1.0 - self.quantile_to_iterate_to)).log2() as u32;
// Calculate the total number of ticks in 0-1 given that half of each slice is tick'd.
// The number of slices is 2 ^ num_halvings, and each slice has two "half distances" to
// tick, so we add an extra power of two to get ticks per whole distance.
// Use u64 math so that there's less risk of overflow with large numbers of ticks and data
// that ends up needing large numbers of halvings.
// TODO calculate the worst case total_ticks and make sure we can't ever overflow here
let total_ticks = (self.ticks_per_half_distance as u64)
.checked_mul(1_u64.checked_shl(num_halvings + 1).expect("too many halvings"))
.expect("too many total ticks");
let increment_size = 1.0 / total_ticks as f64;
self.quantile_to_iterate_to += increment_size;
// Unclear if it's possible for adding a very small increment to 0.999999... to yield > 1.0
// but let's just be safe since FP is weird and this code is not likely to be very hot.
self.quantile_to_iterate_to = 1.0_f64.min(self.quantile_to_iterate_to + increment_size);
true
}

fn more(&mut self, _: usize) -> bool {
// We want one additional last step to 100%
if !self.reached_last_recorded_value && self.hist.count() != 0 {
self.quantile_to_iterate_to = 1.0;
self.reached_last_recorded_value = true;
true
} else {
false
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Java impl, self.quantile_to_iterate_to ends up being what is exposed as the quantile iterated to, rather than calculating (accumulated count) / (total count) at each iteration point as the Rust impl does (and also the Java impl for all iterators other than percentile). Thus, in the Java impl, you would end up at quantile 0.998... or similar when you ended up at the last nonzero bucket, and it was aesthetically pleasing to nudge the iterator one more slot forward to get to 1.0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... maybe the Java way is better and we should expose that as well? Java uses getPercentile() vs getPercentileIteratedTo() to allow consumers to differentiate. We could expose an extra field as well in IterationValue, or apply type system shenanigans to have per-iterator value types (<V extends IterationValue>?) or bolted-on extra data (IterationValue<T, V> for some per-iterator associated type V). The upside is that it lets quantile iteration like in the dump-to-stdout example show the internal quantile value's small changes to make it clear that forward progress is happening, even when we stay at the same value for a while.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having another value in IterationValue sounds perfectly sensible.
Not sure it'll be accessed much, but doesn't hurt to have it there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll go ahead and add it. It will make the quantile iteration output a little easier to visually comprehend.

// No need to go past the point where cumulative count == total count, because that's
// quantile 1.0 and will be reported as such in the IterationValue, even if
// `quantile_to_iterate_to` is somewhere below 1.0 -- we still got to the final bucket.
false
}
}
3 changes: 2 additions & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,7 @@ extern crate num_traits as num;

use std::borrow::Borrow;
use std::cmp;
use std::fmt;
use std::ops::{AddAssign, SubAssign};
use num::ToPrimitive;

Expand All @@ -218,7 +219,7 @@ const ORIGINAL_MAX: u64 = 0;
/// Partial ordering is used for threshholding, also usually in the context of quantiles.
pub trait Counter
: num::Num + num::ToPrimitive + num::FromPrimitive + num::Saturating + num::CheckedSub
+ num::CheckedAdd + Copy + PartialOrd<Self> {
+ num::CheckedAdd + Copy + PartialOrd<Self> + fmt::Debug {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to allow the use of assert_eq! with a T. Seemed pretty harmless


/// Counter as a f64.
fn as_f64(&self) -> f64;
Expand Down