Skip to content

Communicate information about filtered data points #493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yurivish opened this issue Aug 10, 2021 · 4 comments
Open

Communicate information about filtered data points #493

yurivish opened this issue Aug 10, 2021 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@yurivish
Copy link
Contributor

yurivish commented Aug 10, 2021

It would be useful if exploratory plots came with a visual indicator of “discarded data”.

This would improve Plot's capacity for exploratory data analysis by enabling users to become aware of anomalous values that violate their assumptions about the data.

For example, I changed a scale from log to symlog and discovered a bunch of negative values where I wasn’t expecting any.

The data was supposed to be strictly positive and the negative values indicated a processing error, but since the default log scale filtered those data points out I only noticed because I went out of my way to do additional spot checks.

Plot could have made it evident immediately, e.g. with a legend saying something like “100 datapoints not shown”. Even more useful (maybe) would be being able to see a "data pipeline" and how many points are filtered out at each stage.

@Fil observes that some filters use the discarding as a basic mechanism to do their work as intended, so there are subtle questions about what to communicate for this to be a useful signal.

For the exploratory use case I think it makes sense for this to be on by default, since spot-checking every individual assumption manually can get onerous (e.g. checking for null/undefined, zeros where there shouldn’t be any, negative numbers where there shouldn’t be any, values outside of the x/y/color domain, NaN, etc.)

A separate tool such as a summary table could be used to learn about missing/pathological data in a dataset, but it would still be useful for Plot to flag these issues since they can creep in during downstream processing and plot transformations.

@mbostock mbostock added the enhancement New feature or request label Aug 10, 2021
@Fil Fil self-assigned this Aug 12, 2021
@Fil
Copy link
Contributor

Fil commented Sep 27, 2021

The scale.unknown option can be used to this effect — examples.

@Fil
Copy link
Contributor

Fil commented Jul 18, 2022

This would now happen, I guess, in the default filter

index = index.filter(i => filter(value[i]));
. However with each warning we need to indicate a way to fix the situation, and in this case I wouldn't know what to say, in particular because in many charts some data is ignored on purpose.

@eagereyes
Copy link
Contributor

As an additional twist on this, it would be great if we could provide informative error messages for two seemingly common cases:

  • somebody gets the capitalization of a key wrong, e.g., city instead of City
  • somebody misspells the name of a key, e.g., delivert instead of delivery

Likely candidates for capitalization errors could be found by comparing the key provided to all the keys in the input object in a way that ignores case (i.e., converting both to lowercase before comparing).

Misspellings are more complex than that, possibly using Levenshtein distance and a threshold (or finding the closest match and suggesting that).

The latter is an expensive operation, but it would only have to be run when there's an error (or a presumed error), and it would mostly delay the error message, not interfere with normal Plot operation.

@mbostock mbostock mentioned this issue Feb 7, 2023
13 tasks
Fil added a commit that referenced this issue Mar 14, 2023
…string except when the value is NaN)

Closes #1334

related to #493
This was referenced Mar 14, 2023
mbostock added a commit that referenced this issue Mar 15, 2023
* Guard against formatDefault returning undefined (it always returns a string except when the value is NaN)

Closes #1334

related to #493

* coalesce null to empty string

* DRY

---------

Co-authored-by: Mike Bostock <[email protected]>
@mstade
Copy link
Contributor

mstade commented Nov 20, 2023

It would be useful to also generate a warning when the given data as a whole is nullish, e.g. Plot.lineY(undefined, { x: 'date', y: 'population' }). I've been going a little bit nuts trying to figure out which one of the 7-8 plots on my dashboard where throwing a seemingly random Error: missing scale: y.

Not a fault of Plot that my data is broken of course, but a message like Error: lineY data series is undefined or something more to the point would at least have helped narrow it down.

The documentation does state that Missing and invalid data are handled specifically for each mark type and channel. but this seems in my (admittedly limited) testing to only hold true for datums, not the series as a whole. Simply replacing undefined with [] in my case did the trick for the mark throwing errors.

chaichontat pushed a commit to chaichontat/plot that referenced this issue Jan 14, 2024
* Guard against formatDefault returning undefined (it always returns a string except when the value is NaN)

Closes observablehq#1334

related to observablehq#493

* coalesce null to empty string

* DRY

---------

Co-authored-by: Mike Bostock <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants