-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Deepscatter Review
This is still subject to change and is not fully documented. The encoding portion of the API mimics Vega-Lite with some minor distinctions to avoid deeply-nested queries and to add animation and jitter parameters.
The team has been super supportive and responsive and I am writing this first and foremost out of a place of good faith as an end user and to give the developer team feedback which may be of interest to them.
TL;DR
4/5 + GitHub ⭐️
The good: Built for a niche use case and great for what it is. It's maintainers are active and supportive. Despite being under active development, it works (not always the case with in-development software) and examples work as advertised.
The bad: I am inpatient! I want it to be finished and fully polished already because the potential is amazing.
Requested features
3D
Resize-able
This is very important to my use case. Who likes an overflowing or underfilled div?
I know that my prayers are in the process of being answered (PR#81) @bschmidt = 😇.
Flip y-axis
In a discussion with @bschmidt we talked about why a plot I created was "flipped" when rendered. To the unfamilair this stems from how the web renders pages with
I firmly think the default behavior should be flipped. Why? Well they use and adopt many of d3's conventions. While this is also the default behavior of d3, the difference is exposure. The Scatterplot
object returned by new Deepscatter(htmlID, width, height)
does not have a convenient method exposing the y-axis's range e.g. plot.yAxis.range()
. Instead all of this must be specified fully in the plotAPI.encoding
(based on, but divergent from Vega-Lite).
The point being, that clearly deepscatter
, just by the name having "scatter" in it means that it is at a higher level of abstraction than d3 and yet, requiring the user to specify the range in the plotAPI.encoding.y
channel is still coupled to these lower levels. By default it should render the way most people expect that it should render as.
Calculate domain and range
Similar to flipping the y-axis since there is not a plot.yAxis.range()
method, there are not plot.getDomain(feature)
and plot.getRange(feature)
methods. This matters. As stated deepscatter
is at a higher level of abstraction. So without these methods we, the end user, must also schlep around meta-data that isn't even accessible under plot._iroot.schema
, plot._iroot.schema.fields
, etc. This is even more frustrating as an end user as to use deepscatter
we explictily have to use quadfeather
. Additionally, as stated above, how plot
handles the arguments of plotAPI
, especially plotAPI.encoding
changes depending on how much information is provided. So specifying the correct range but not the domain yields undesirable behavior and vis-versa (doubly so for colors).
Thankfully I am not completely alone in this opinion:
"The challenge here is that we can’t confidently calculate extent of metadata in deepscatter because there is not a guarantee that the browser has access to all the data in the dataset--if you have 100 million points, we won’t load all of them."
It might be possible and reasonable to do this in quadfeather though and add it as column metadata.
@bschmidt
Some other meta data for consideration:
- the total number of points in the dataset should also be included somewhere (to set a "max number of points" slider).
- categories of categorical features (or at least a method that loops over the current categories)
Fix encodings
By far the most aggrevating part of deepscatter
as an end user are the channels encoding
. While having the ability to specify these things is nice, per all my previously mentioned notes combined the number of channels, it aggreagates to be a very clunky feeling interface.
Additionally, per this demo (featherplot
(svelte) Pages) one can see that
encoding = {
color: {
field: "conditions",
domain: [0, 3] // NOTE: <--- categorical i.e. 0, 1, 2, 3, but values
// are "condition 1", "condition 2", "condition 3", etc
},
}
it doesn't seem easy to have a categorical colors, even though the exported column was categorical.
Likewise, if range: "viridis"
is not specified, even with domain: [-3.97, 3.25]
, you will not see a color gradient.
Under the section recommendations, I suggest some ways encoding
can be made more friendly for users.
Fix plotAPI
Sometimes plot.plotAPI(...opts)
can handle rapid changes in values. Othertimes, it can not. Going back to featherplot
(svelte) Pages, moving the number of points slider rapidly works.
[Error] Error: (regl) cannot cancel a frame twice
f — deepscatter.39d51a89.js:2:90
c — deepscatter.39d51a89.js:2:149
Be — deepscatter.39d51a89.js:8:52497
(anonymous function) — deepscatter.39d51a89.js:1688:31555
(anonymous function) (2.5a5494e2.js:32:8792)
Inconsistencies
Types vs Usage
plotAPI(...opts)
takes an APICall
object which has:
type APICall = {
...,
background_options?: BackgroundOptions
}
type BackgroundOptions = {
...,
color?: string
}
but... background_color
is directly read from prefs
:
ctx.fillStyle = prefs.background_color || 'rgba(133, 133, 111, .8)';
So setting prefs.background_options.color
does nothing (yes, see disclaimer at the top of this. API is subject to change).
Complex Encoding Types
Disclaimer: I make no claims at being an expert programmer, especially in TypeScript. This is the most complex project that I am not the primary author of using TypeScript that I have investigated.
It appears that there is muddled use of undefined
and null
. Consider the Encoding
type:
export type Encoding = {
x?: RootChannel;
y?: RootChannel;
color?: null | ColorChannel;
size?: null | RootChannel;
shape?: null | RootChannel;
filter?: null | FunctionalChannel;
filter2?: null | FunctionalChannel;
jitter_radius?: null | JitterChannel;
jitter_speed?: null | RootChannel;
x0?: null | RootChannel;
y0?: null | RootChannel;
position?: string;
position0?: string;
foreground?: null | FunctionalChannel;
};
So what is the difference between undefined
and null
undefined
: declared variable, but unassigned valuenull
: can be used to explicity indicate abscence sinceundefined
is "missing".
Yet these null
values are used for whether or not the value exists at all, and internally
if (prefs.zoom !== undefined) {
// ...
if (prefs.zoom === null) {
// ...
}
// ...
}
// ...
if (prefs.label !== undefined) {
// ...
if prefs.labels === null {
// ...
}
// ...
}
put these cases could be handled by checking argument input, or conditional operators
// option 1: check if null then coerce to defaults
if (prefs.zoom === null) {
// set to default values, set to undefined, or raise error
}
// ...
// with defaults set if zoom is provided
if (prefs?.zoom) {
}
// option 2: just add it as part of if / else
if (prefs.zoom === null) {
// ...
} else if (prefs?.zoom) {
// ...
}
// option 3: break channels into 4 cases:
// null, undefined, empty object, non-empty value
const isObj = (o) => o !== null && typeof o === 'object'
const isEmptyObj = (o) => (isObj(o) && Object.keys(0).length === 0)
switch (prefs.zoom) {
case null:
// handle explicilty missing
break;
case undefined:
// handle not set
break;
default:
if (isEmptyObj(o)) break;
// not empty obj
}
const handleArg = (a, doNull, doUndef, doArg, doEmpty) => {
switch (a) {
case null:
return doNull(a)
case undefined:
return doUndef(a)
default:
if (isEmptyObj(a)) return doEmpty(a)
return doArg(a)
}
}
Why basically complain about this? RootChannel
is too complex a type and it gives me squiggles when working with it. Encoding
is even worse because it encoding.x
can be null
, undefined
, string
, or basically an object with differnet fields. I don't like squiggles and type guards (at least the ones I wrote) don't handle it well.
Quadfeather
Remove required columns
Per this note:
Future codebase splits
The plotting components and the tiling components are logically quite separate; I may break the tiling strategy into a separate JS library called 'quadfeather'.
quadfeather
was in fact abstracted away. While we - the development team and I going back and forth with some reproducable MWEs - found some bugs with CSV
file formats (see MWE: missing points from CSV), as on their docs it shows:
quadfeather --files tmp.csv --tile_size 50_000 --destination tiles
"There was some old code hanging around, that was designed to aid a 3d version of quadfeather and so treated columns with the name
z
differently than other metadata columns (basically, it tried to start building an oct-tree instead of a quadtree, and had nowhere to put half the data.)"
I still have greviances with quadfeather
. Namely, it does not work if you do not have a x
and y
column in the DataFrame. Very annoying. Per my previous comments, deepscatter
is at a higher level of abstract than d3
, my data doesn't need an x
and y
column. I should only need to tell deepscatter
what x
and y
are. However, due to quadfeather
, I am going to have to have x
and y
.
Inefficient sidecars
In my discussions with @bschmidt I informed them of my use case. Namely, I am working with Single-Cell data. I have an embedding (like PCA, t-SNE, PHATE, etc) with
Using a yet published to quadfeather
script add_sidecars.py
(courtesy of @bschmidt) I was able to partition my gene features and add them to my dataset (see MWE: quadfeather
with sidecars, and live here).
However: when scaled to my actual data. A 2.56
GiB parquet file balloons to comical 474.31
GiB. So that is not really a practical option. Yes, I did double check to make sure I was using doubles (float32
).
After add_sidecars.py
with float32
:
After add_sidecars.py
with float16
:
This is especially bad as my original solution was quadfeather
.
Recommendations
It wouldn't be fair to just state my grievances when working the package without providing some ideas on how to perhaps tackle it, if not improve it.
In general my recommendations break down into 3 parts:
- create a new class (or augment the
Scatterplot
class) to handle the interface and tradeoff of state between the user and thedeepscatter
library, - expose methods for setting and getting individual aspects of
APICall
, and - develop individual svelte components for each argument of
APICall
.
So lets break down why I think this is the way to go moving forward and how it may help improve the deepscatter
library.
A case for a state class
I understand deepscatter
is meant to be framework agnostic. So I am not going to explicilty state that this should be a svelte store (see the next section contributions). However, I believe having an interface layer between the Scatterplot
and the end user may be of interest to the developers of deepscatter
, espeically as things (like the API) are still in flux. By having an intermediate layer you can change anything you want about deepscatter
's Scatterplot
without affecting the user endpoints. Additionally it will also make it easier for others to contribute in a non-breaking way. For example, if they want to add features, by working on this state middleware, the complex and core logic of reading the quadfeather
data remains safe and unchanged. In addition, it allows more control over when the middleware state changes actually trigger Scatterplot
changes (e.g. by debouncing / checking whether or not the plot is ready like its queue, etc).
Expose getters and setters
Much like an the state class, these will allow the user's endpoint to be unaffected by internal changes. It also makes deepscatter
more declaritive and d3
-like. For example one could imagine:
const plot = new Deepscatter(...args, width, height)
.x('PCA 1').xscale('log').xdomain('auto')
.y('PCA 2').xscale('linear').ydomain('auto')
.yrange([1, 0])
.c('index').color('viridis')
.autosize(true)
.bgColor('transparent')
Development via individual svelte components
A couple of reasons.
-
it is sort of like extra tests and auto documentation all in one. Want to work on animations for changing the x channel of
encoding
? An axis select lets you do this and you can point users there for how to deal with encodings. -
it makes it easier for users to build on top of. Rather than have to go through the code base to find out what has changed, etc they can start with that component.
-
it makes it easier for users to use your library. They can just take the components and run with it.
Contributions
Note: to the developers of deepscatter
, feel free to take any of the code, ideas, MWEs, etc from the links below if you feel like they help. These are how I found the bug regarding quadfeather
and CSVs, the discrepancies of APICall
and prefs
, the lack of scalability for add_sidecars.py
.
Given the discussions I have had with @bschmidt, my experience with deepscatter
(frustration and excitement), and this post I pasting some links which has
some notebooks for reproducing MWEs and demonstrating how I am using deepscatter
.
Of note, I made two libraries, both called featherplot
. One is in python, the other in svelte. The python
Python
This notebook from the python library show cases some of the frustrations I have had with quadfeather
with deepscatter
.
# load data
# ...
# Class for renaming embedding columns automatically for quadfeather
qfr = QuadFeatherRenamer(df_all)
# ...
# Grab metadata from dataframe and be ready to partition into main dataframe and sidecars
d2m = DataFrameToMetadata(
df_q,
include_index=True,
embedding='x y z'.split(),
alt_names={v:k for k,v in renamed.items()}
)
succ, fail = d2m.convert()
meta = d2m.to_meta()
# ...
!quadfeather --files {p_file} \
--tile_size {tile_size} \
--destination {qf_dir}
# ...
feather.write_feather(d2m.df.drop(columns=d2m.embedding), f_file)
!featherplot add-sidecars --tileset {qf_dir}\
--sidecar {f_file} --key {d2m.df.index.name};
Svelte
Likewise the svelte version does a few things of note:
- simplifies types
- implements store
- MWE playground for debugging
usage
On +page.svelte
we can see how a store can simplify the rendering of a deepscatter
object:
// imports
// ...
// for data from +page.ts
export let data
$: meta = data?.meta
onMount(async() => {
await meta
plotStore.url = `${base}${meta.tiles_dir}`
plotStore.totalPoints = meta.n_points // for max point slider
plotStore.embedding = meta.embedding // columns for showing on x / y
plotStore.sidecars = meta.sidecars // columns for coloring by
plotStore.columns = meta.columns_metadata // domain / range / categories of all columns
// NOTE: setters will auto create the RootChannels based on metadata :)
plotStore.xField = meta.embedding[0] // 'x'
plotStore.yField = meta.embedding[1] // 'y'
plotStore.cField = meta.sidecars[0] // 'conditions'
})
</script>
then for the HTML:
<div class="grid grid-cols-4 h-screen justify-content-center">
<div class="">
</div>
<div class="col-span-2">
<FeatherPlot class="border-2 border-sky-400"/>
<div class="bg-slate-100 p-4 max-h-92 overflow-scroll">
<MaxPointsSlider on:change={handleChange} />
<AxesSelect on:change={handleChange} />
<ColorSelect on:change={handleChange} />
<ZoomCall on:change={handleChange} />
</div>
</div>
<div class="">
</div>
</div>
yay, so simple.
types
We can also simplify types.ts:
// ...
export type BaseConstantChannel<T> = {
constant: T;
};
export type ConstantBool = BaseConstantChannel<boolean>;
export type ConstantNumber = BaseConstantChannel<number>;
export type ConstantColorChannel = BaseConstantChannel<string>;
export type ConstantChannel = ConstantBool | ConstantNumber | ConstantColorChannel;
export interface ChannelBase {
field: string;
[name: string]: any;
}
export interface ConditionalChannel extends ChannelBase {
a: number;
b?: number;
op: Conditional;
}
export interface LambdaChannel extends ChannelBase {
lambda: string;
domain: Domain;
range: Range;
}
export interface BasicChannel extends ChannelBase {
transform?: Transform;
domain?: MinMax;
range?: MinMax;
[name: string]: any;
}
export interface CategoricalChannel extends ChannelBase {}
Now all the channels have a common ancestor due to their requirement of the field
field.
state
and in PlotStore.ts
, we find a bunch of useful stuff
// ...
public async LoadDeepscatter() {
const Scatterplot = await import('deepscatter');
this.Deepscatter = Scatterplot.default;
this.debug({ status: 'Deepscatter imported'}, Scatterplot);
}
// ...
// the autofit stuff from ealier
// ...
private _schema: any | null = null;
get schema() {
return this._schema;
}
set schema(s: any | null) {
this._schema = s;
}
get fields() {
return (this.ready ? this.plot?._root?.schema?.fields : []);
}
get extents() {
return (this.ready ? this.plot?._root?._extents : {}) as Extents;
}
get xExtent() { return this.extents?.x || null; }
get yExtent() { return this.extents?.y || null; }
get zExtent() { return this.extents?.z || null; }
// ...
// ...
get xField() {
return this.getEncodingField('x');
}
// NOTE: setting store.xField automatically create the RootChannel!
set xField(x: string | undefined) {
this.setEncoding('x', this.getColumnAsEncoding(x));
}
// ...
// can easily check x, y, and color encodings
get x() {
if (!this.xField) return undefined;
if (!this.columns) return undefined;
return this.getColumnAsEncoding(this.xField as EncodingKey);
}
get y() {
if (!this.yField) return undefined;
if (!this.columns) return undefined;
return this.getColumnAsEncoding(this.yField as EncodingKey);
}
get color() {
if (!this.cField) return undefined;
if (!this.columns) return undefined;
return this.getColumnAsEncoding(this.cField as EncodingKey);
}
// ...
This is how in the usage we saw plotStore.xField = 'x'
works as the setter for xField
will automatically create the RootEncoding
from the metadata. Of course we could always manually set the x encoding if we needed more customization, but generally the domain of x
is constant, its range is depednent on the transformation, etc.
Looking at the demo, we see how much the components help with making deepscatter
more accessible so others can appreciate all of your hard work.
Conclusion
deepscatter
is a great library. I can't wait for the NOMIC.ai team to continue to improve upon it. Special thanks to @bschmidt.
Acknowledgements
I'd like to take a moment to express my heartfelt gratitude to @bschmidt for all of their assistance while working with their library, their support, feedback and even provided suggestions to improve my project based on a M.W.E. Truly a kind person and had no reason to be so patient and understanding. 😇
🙏🙏🙏🙏🙏🙏🙏🙏
Links
-
quadfeather