Skip to content

layouts: dodge, hexbin #775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

layouts: dodge, hexbin #775

wants to merge 7 commits into from

Conversation

Fil
Copy link
Contributor

@Fil Fil commented Feb 23, 2022

supersedes #648 #711 #713 #740

There were too many conflicts to rebase on #648. The main difference is that instead of acting only on the values of one facet, a layout now receives the indices of all facets, and must return the index, (scaled) values, and possibly new channels that will be scaled in a second pass. It is allowed to hint at the scale defaults it wants (but they are always overrided if the scale already exists because of another mark).

Fixes #164.

@mbostock
Copy link
Member

mbostock commented Feb 24, 2022

There were too many conflicts to rebase

Here’s a pattern that I just used, if it helps. First, checkout the main branch (make sure you’re up-to-date):

git checkout main

Then create a new, temporary branch:

git checkout -b mbostock/feature-squash

Then squash merge the existing branch mbostock/feature into this temporary branch:

git merge --squash mbostock/feature

Commit after resolving differences:

git commit

Then switch back to the feature branch:

git checkout mbostock/feature

Reset it to the squashed merge:

git reset --hard mbostock/feature-squash

Push it up to GitHub:

git push -f

And lastly clean up the temporary branch:

git branch -D mbostock/feature-squash

This workflow will squash any branch down to a single commit, minimizing the number of conflicts you need to resolve.

@Fil
Copy link
Contributor Author

Fil commented Feb 25, 2022

Thanks! fwiw the relevant diff with #648 is the following:

diff --git a/src/layouts/dodge.js b/src/layouts/dodge.js
index 5e737151..862a9c9d 100644
--- a/src/layouts/dodge.js
+++ b/src/layouts/dodge.js
@@ -1,7 +1,7 @@
 import {max} from "d3";
 import IntervalTree from "interval-tree-1d";
-import {maybeNumberChannel} from "../options.js";
 import {layout} from "./index.js";
+import {finite, positive} from "../defined.js";
 
 const anchorXLeft = ({marginLeft}) => [1, marginLeft];
 const anchorXRight = ({width, marginRight}) => [-1, width - marginRight];
@@ -39,43 +39,48 @@ export function dodgeY(dodgeOptions = {}, options = {}) {
 }
 
 function dodge(y, x, anchor, padding, options) {
-  const [, r] = maybeNumberChannel(options.r, 3);
-  return layout(options, (I, scales, {[x]: X, r: R}, dimensions) => {
+  return layout(options, function(index, scales, values, dimensions) {
+    let {[x]: X, [y]: Y, r: R} = values;
+    const r = R ? undefined : this.r !== undefined ? this.r : options.r !== undefined ? +options.r : 3;
     if (X == null) throw new Error(`missing channel: ${x}`);
     let [ky, ty] = anchor(dimensions);
     const compare = ky ? compareAscending : compareSymmetric;
-    if (ky) ty += ky * ((R ? max(I, i => R[i]) : r) + padding); else ky = 1;
-    if (!R) R = new Float64Array(X.length).fill(r);
-    const Y = new Float64Array(X.length);
-    const tree = IntervalTree();
-    for (const i of I) {
-      const intervals = [];
-      const l = X[i] - R[i];
-      const r = X[i] + R[i];
-
-      // For any previously placed circles that may overlap this circle, compute
-      // the y-positions that place this circle tangent to these other circles.
-      // https://observablehq.com/@mbostock/circle-offset-along-line
-      tree.queryInterval(l - padding, r + padding, ([,, j]) => {
-        const yj = Y[j];
-        const dx = X[i] - X[j];
-        const dr = R[i] + padding + R[j];
-        const dy = Math.sqrt(dr * dr - dx * dx);
-        intervals.push([yj - dy, yj + dy]);
-      });
-
-      // Find the best y-value where this circle can fit.
-      for (let y of intervals.flat().sort(compare)) {
-        if (intervals.every(([lo, hi]) => y <= lo || y >= hi)) {
-          Y[i] = y;
-          break;
+    if (ky) ty += ky * ((R ? max(index.flat(), i => R[i]) : r) + padding); else ky = 1;
+    if (!R) R = values.r = new Float64Array(X.length).fill(r);
+    if (!Y) Y = values[y] = new Float64Array(X.length);
+    for (let I of index) {
+      const tree = IntervalTree();
+      I = I.filter(i => finite(X[i]) && positive(R[i]));
+      for (const i of I) {
+        const intervals = [];
+        const l = X[i] - R[i];
+        const r = X[i] + R[i];
+  
+        // For any previously placed circles that may overlap this circle, compute
+        // the y-positions that place this circle tangent to these other circles.
+        // https://observablehq.com/@mbostock/circle-offset-along-line
+        tree.queryInterval(l - padding, r + padding, ([,, j]) => {
+          const yj = Y[j];
+          const dx = X[i] - X[j];
+          const dr = R[i] + padding + R[j];
+          const dy = Math.sqrt(dr * dr - dx * dx);
+          intervals.push([yj - dy, yj + dy]);
+        });
+  
+        // Find the best y-value where this circle can fit.
+        for (let y of intervals.flat().sort(compare)) {
+          if (intervals.every(([lo, hi]) => y <= lo || y >= hi)) {
+            Y[i] = y;
+            break;
+          }
         }
+  
+        // Insert the placed circle into the interval tree.
+        tree.insert([l, r, i]);
       }
-
-      // Insert the placed circle into the interval tree.
-      tree.insert([l, r, i]);
+      for (const i of I) Y[i] = Y[i] * ky + ty;
     }
-    return {[y]: Y.map(y => y * ky + ty)};
+    return {index, values};
   });
 }

@mbostock mbostock force-pushed the fil/transform-layout branch 2 times, most recently from 09d164d to 3971461 Compare March 2, 2022 19:14
@mbostock mbostock force-pushed the fil/transform-layout branch from 3971461 to d6f750f Compare March 2, 2022 19:18
let [ky, ty] = anchor(dimensions);
const compare = ky ? compareAscending : compareSymmetric;
if (ky) ty += ky * ((R ? max(index.flat(), i => R[i]) : r) + padding); else ky = 1;
if (!R) R = values.r = new Float64Array(X.length).fill(r);
Copy link
Member

@mbostock mbostock Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mutating an input argument (values). Watch out for this! (I’ll fix.)

scaleDescriptors = {...newScaleDescriptors, ...scaleDescriptors};
scales = ScaleFunctions(scaleDescriptors);
for (const layout of layouts) {
if (layout.channels) layout.values = {...layout.values, ...applyScales(layout.channels, scales)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly this is mutating layout, which was returned by mark.layout; we shouldn’t mutate an object we didn’t create (e.g. the mark might use memoization).

Comment on lines +177 to +185
for (const [, channel] of channels) {
const {scale} = channel;
if (scale !== undefined) {
const {percent, transform = percent ? x => x * 100 : undefined} = options[scale] || {};
if (transform != null) channel.value = Array.from(channel.value, transform);
if (scaleChannels.has(scale)) scaleChannels.get(scale).push(channel);
else scaleChannels.set(scale, [channel]);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicating code from Plot.plot:

plot/src/plot.js

Lines 41 to 49 in 352d086

for (const [, channel] of channels) {
const {scale} = channel;
if (scale !== undefined) {
const {percent, transform = percent ? x => x * 100 : undefined} = options[scale] || {};
if (transform != null) channel.value = Array.from(channel.value, transform);
if (scaleChannels.has(scale)) scaleChannels.get(scale).push(channel);
else scaleChannels.set(scale, [channel]);
}
}

const newOptions = {};
for (let i = 0; i < marks.length; ++i) {
const mark = marks[i];
const {channels} = layouts[i] = applyLayout.call(mark, mark.layout, [markIndex.get(mark)], scaleDescriptors, markChannels.get(mark), dimensions, options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is ignoring the index and values returned by mark.layout. I think I follow how the values are being ignored because the code currently relies on mutation (https://github.com/observablehq/plot/pull/775/files#r818901600). I’m a little surprised that it’s ignoring the index though, here. We don’t ignore the index in the faceting case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, never mind. The layout is stored in layouts and then layout.index is referenced later.

@mbostock
Copy link
Member

mbostock commented Mar 3, 2022

(I’m mostly making notes to myself as I go through this code. You don’t need to respond to any of the comments above but we can talk about them soon!)

let index = markIndex.get(mark);
if (mark.filter != null) index = mark.filter(index, channels, values);
const node = mark.render(index, scales, values, dimensions, axes);
let {index: [I], values} = layouts[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that layout.index is nested, we should probably call it layout.facets instead of layout.index. We make a similar distinction in that mark.transform returns {data, facets} (rather than {data, index}), and when the mark is not faceted, then facets is [index].

Copy link
Member

@mbostock mbostock Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it’s also the case, confusingly, that mark.initialize always returns {index, channels} and index may either be a flat array for non-faceted marks or a nested array for faceted marks. (Though markIndex will always contain a flat index.)

function applyLayout(layout, index, scaleDescriptors, channels, dimensions, options) {
const scales = ScaleFunctions(scaleDescriptors);
const values = channels ? applyScales(channels, scales) : {};
return layout ? layout.call(this, index, scaleDescriptors, values, dimensions, options) : {index, values};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is passing the entire options object to mark.layout. I don’t think we want to expose that much information to the layout—it risks making the layouts unwieldy because their behavior could be dependent on anything in the plot specification. It looks like this options is only needed for faceting (not by our current hexbin and dodge layouts), so, I suspect we’ll need to find a different way to implement faceted layouts to avoid exposing options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also passing scaleDescriptors to mark.layout, but I think we want scales (i.e., scale functions) for symmetry with mark.render. In any case the scales are not used by the current layouts.

const channels = [];
const newIndex = [];
for (const o of outputs) {
o.initialize(this.data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referencing mark.data, rather than whatever the possible prior mark.transform returned as data during mark.initialize. Probably this means we’ll need to pass the (possibly transformed) data to the layout rather than expecting layouts to reference this.data.

const {channels} = layouts[i] = applyLayout.call(mark, mark.layout, [markIndex.get(mark)], scaleDescriptors, markChannels.get(mark), dimensions, options);
if (channels) {
for (const [, {scale, options: hint}] of channels) {
if (scale) newOptions[scale] = {...hint, ...options[scale]};
Copy link
Member

@mbostock mbostock Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have multiple layouts that define multiple channels for a given scale, or even a single layout that defines multiple channels for the same scale, this won’t compose the scale options; instead, it’s “last one wins”, i.e., the last channel definition for a given scale is respected and the scale options defined by other channels produced by layouts are ignored. Probably we need to accumulate into newOptions. But I’m also curious if we can sidestep this approach entirely by using channel hints rather than letting layouts inject scale options directly.

@@ -279,17 +327,57 @@ class Facet extends Mark {
}
marksChannels.push(markChannels);
}
this.layout = function(index, scaleDescriptors, values, dimensions, options) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn’t assign this method as a side-effect of calling facet.initialize; it should be a method on the Facet class.

Comment on lines +361 to +366
const newScaleDescriptors = Scales(newScaleChannels, newOptions);
Object.assign(scaleDescriptors, newScaleDescriptors, scaleDescriptors);
const rescales = {...ScaleFunctions(newScaleDescriptors), ...scales};
for (let i = 0; i < marks.length; ++i) {
marksValues[i] = {...marksValues[i], ...layouts[i].values, ...applyScales(layouts[i].channels || [], rescales)};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code for facets is closely related to this non-faceted code, and it would be nice to unify them somehow:

plot/src/plot.js

Lines 117 to 122 in 352d086

const newScaleDescriptors = Scales(newScaleChannels, newOptions);
scaleDescriptors = {...newScaleDescriptors, ...scaleDescriptors};
scales = ScaleFunctions(scaleDescriptors);
for (const layout of layouts) {
if (layout.channels) layout.values = {...layout.values, ...applyScales(layout.channels, scales)};
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is the case for a lot of code, and I've wondered several times if we should not route everything through the faceted code, even when there is no faceting (the non-faceted case being just 1 facet with all the data, and null fx/fy). This might also help to clear up the difference between faceted and non-faceted indexes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe! I tried before and the current state in main is the best I could come up with, but there’s always room for improvement. 😁 A little code duplication isn’t harmful, especially if it’s not visible externally, and there are usually ways to extract shared logic into shared helper functions to minimize disruption. This was just a reminder to look for a way to reduce the duplication, if possible.

(I’ve broken faceting in my working branch, but I’ll get back to it soon.)

// * applying the scales, possibly pushing a new one (for legends)
// * returning a filtered index and values to render, possibly new channels to rescale
const newScaleChannels = new Map();
const layouts = [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This array overlaps with markIndex: layouts[i].index is equal to markIndex.get(marks[i]) except in the case where the layout returned a new index. We should ideally normalize the state so there are no duplicate (and obsolete or inconsistent) representations.


function composeLayout(l1, l2) {
return function(index, scales, values, dimensions) {
values = l1.call(this, index, scales, values, dimensions);
Copy link
Member

@mbostock mbostock Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the redesigned signature, mark.layout returns {index, values, channels}, so we don’t want to assign to values here. Similarly the implementation of partialLayout will need to change.

import {maybeOutputs} from "../transforms/group.js";

const defaults = {
ariaLabel: "hex",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaults here mean something different than they do for marks, where they are passed as a separate argument to the Mark constructor: these defaults mean default options. If you specify {ariaLabel: "hex"}, it means that you want an ariaLabel channel derived from the “hex” field (d => d.hex) of data. When I fixed mark filtering this was causing all the marks to be dropped because the ariaLabel channel values were [undefined, undefined, undefined, …]. Oops! 😄

const newIndex = [];
for (const o of outputs) {
o.initialize(this.data);
o.scope("data", index);
Copy link
Member

@mbostock mbostock Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call shouldn’t be necessary—calling output.initialize will invoke the reducer:

if (reducer.scope === "data") {
context = reducer.reduce(range(data), V);
}

Also here index is a nested array (like facets).

Comment on lines +117 to +118
const newScaleDescriptors = Scales(newScaleChannels, newOptions);
scaleDescriptors = {...newScaleDescriptors, ...scaleDescriptors};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we’re constructing new scales here, we have to be careful because it’s possible that a layout can define new channels for a scale that already existed and had existing channels. This code only considers the new channels when creating new scales, but there might be some existing channels, too—we need to combine the new and old channels together for each scale that needs to be reconstructed after computing the layouts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this question I took the view that applying a layout would not recompute the values for the older channels (they can operate on their mark only), and would not be allowed to modify an existing scale once it's been instantiated (in order to avoid any loop if one layout derives color from x, and the other derives x from color). Maybe overly simplistic?

Copy link
Member

@mbostock mbostock Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, it computes the new scale but then it gets overwritten by the old definition because ...scaleDescriptors is second. I missed that.

I’m proposing instead a new scale is derived from the union of old and new channels, just like if you added another mark to a plot, replacing the old scale definition.

Copy link
Member

@mbostock mbostock Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it’s not quite working like I expect. I wonder what I’m missing. For the hexbinDot example, there’s an old channel associated with the r scale that represents the input to the hexbin:

[3750, 3800, 3250, NaN, 3450, 3650, 3625, 4675, 3475, 4250, ]

And the new channel associated with the r scale represents the output of the hexbin:

[24, 25, 26, 4, 7, 4, 3, 10, 1, 1, 23, ]

The old channel probably shouldn’t be associated with the r scale since it’s unscaled. (The numbers represent the mass of penguins in grams, not radii in pixels.) I need to dig in a bit deeper to figure out what’s going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think the way to think about this is that the r channel associated with the dot mark is getting replaced by the hexbin layout. So we don’t want to union the two channels here—we just want the new channel. But we’ll need to carefully track which channels are getting replaced (same name, same mark) to determine the surviving set of channels associated with each scale.

values[o.name] = o.output.transform();
}
}
if (!channels.find(([key]) => key === "r")) values.r = Array.from(values.x).fill(radius);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than constructing an r channel when the radius is constant, we can instead populate the r option as a constant up above.

const bins = hbin(I, X, Y, radius);
for (const o of outputs) {
o.scope("facet", I);
for (const bin of bins) o.reduce(bin);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are typically many more bins than outputs, so I suspect it’s more efficient to do another for (const o of outputs) inside the for (const bin of bins) loop below, rather than another for (const bin of bins) loop here.

@mbostock
Copy link
Member

mbostock commented Mar 3, 2022

The hexbin layout will probably need an optional group by z, like the bin transform does. This would let you have overlapping hexagons of different colors, for example. Related #197.


export function hexbin(outputs, options) {
([outputs, options] = mergeOptions(outputs, options));
const {radius, ...inputs} = options;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both hexbin and hexgrid, I think we’ll want to use r instead of radius for consistency with dot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it’s complicated. We can’t just use r because you might want to use the r channel, e.g.,

Plot.hexbin({r: "sum"}, {x: "culmen_depth_mm", y: "culmen_length_mm", r: "body_mass_g"})

so you need a separate option to specify the size of the hexagonal grid.

At the same time, though, it would really be nice to say:

Plot.hexbin({fill: "count"}, {x: "culmen_depth_mm", y: "culmen_length_mm", r: 20})

Perhaps we can find a way to make both work.

}

// Allow hexbin options to be specified as part of outputs; merge them into options.
function mergeOptions({radius = 10, ...outputs}, options) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default radius of 10 is defined this way, then no default will be provided if options says something like {radius: undefined}.

@mbostock
Copy link
Member

mbostock commented Mar 4, 2022

Something to think about in the case of the hexbin transform: the scales get computed twice. For example when you say this:

Plot.dot(penguins, Plot.hexbin({r: "count"}, {x: "culmen_depth_mm", y: "culmen_length_mm", r: "body_mass_g"}))

The first r scale is computed given individual observations (e.g., 344 values representing body mass in grams of penguins). Then the hexbin layout produces a new r channel for the hexagonal bins (e.g., 43 values representing counts of observations in each bin). The r scale is then reconstructed using this new channel and the original scale definition is discarded.

It’d be nice if somehow we could avoid the construction of the first r scale since it won’t ever be used. And likewise there’s no reason to materialize the scaled r channel values using this first r scale. But we won’t know that the r channel will be replaced until the layout is invoked. And we don’t know what channels the layout might consume. So, I guess it’s unavoidable in our current design, but it’d be slightly better maybe if the layout could declare statically what channels it consumes and produces so that we could avoid unnecessary computation.

@mbostock
Copy link
Member

mbostock commented Mar 4, 2022

The hexbin layout will probably need an optional group by z, like the bin transform does.

Related to this, the hexbin layout should function more similarly to the bin transform in that it should define a transform option. For example if you look at the transformed data that is generated by the bin transform, you get this (i.e., nested arrays of the input data):

[Array(1), Array(7), Array(9), Array(4), Array(1), Array(5), Array(10), …]

But since the hexbin layout doesn’t similarly define the transform option, the data is unchanged (the 344 rows of penguin observations).

This distinction isn’t currently visible, but it will be when it comes to things like interaction, where we might want to e.g. allow brushing over hexbins.

@mbostock
Copy link
Member

mbostock commented Mar 4, 2022

But since the hexbin layout doesn’t similarly define the transform option…

I mean, we can’t use the existing transform option since that is invoked too soon—for hexbins we need to wait until after the scales have been constructed and applied to x and y. But it means that layouts in effect will need to be a superset of transforms, and that means they need to be able to return new data and not just a new index.

Comment on lines +93 to +101
for (const o of outputs) {
if (o.name in rescales) {
const {scale, options} = rescales[o.name];
const value = o.output.transform();
channels.push([o.name, {scale, value, options}]);
} else {
values[o.name] = o.output.transform();
}
}
Copy link
Member

@mbostock mbostock Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using the “materialized” representation of a channel, similar to that returned by the Channel constructor, rather than the channel “descriptors” that are used e.g. by the Mark constructor and mark.channels. Channel descriptors are materialized inside of the mark.initialize, given data. I wonder if it would be more appropriate to use the “channel descriptor” representation here, rather than having layouts return materialized channels, so as to avoid introducing a new public signature for channel objects.

This is the sort of thing that makes me wish we used TypeScript, since we’d be forced to be explicit about the shapes of these objects. We’ll get there eventually I guess. 😄

@mbostock mbostock mentioned this pull request Mar 11, 2022
16 tasks
@mbostock
Copy link
Member

Superseded by #801. 😅

@mbostock mbostock closed this Mar 11, 2022
@Fil Fil mentioned this pull request Jan 16, 2023
8 tasks
@Fil Fil deleted the fil/transform-layout branch April 5, 2023 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beeswarm plot?
2 participants