Skip to content

Fetch and cache data at runtime #839

Closed
@net

Description

@net

With Observable, we're able to connect an external database, then query that database on notebook load, often interpolating inputs from the notebook into our queries. Observable then proxies those queries through its database proxy service.

This provides three important benefits:

  • Notebooks can use live data.
  • Queries can use data from inputs.
  • Sensitive DB credentials or API keys don't have to be leaked to the browser.

As far as I can tell, Framework currently has no way to handle this. It looks like Framework once had DatabaseClient support #13, but it was removed for unclear reasons #594, #359.


Proposal

Data loaders are awesome. However, I believe a specific detail in their current implementation has certain drawbacks.

To illustrate, let's look at the three types of notebooks (I'm using the Observable parlance, but you could call these notebooks, pages, dashboards):

Point-In-Time Notebooks

These are notebooks where an analysis is built around a fixed snapshot of data. They often have accompanying prose that explains the data, the analysis, and authors' conclusions. The conclusions of these reports are fixed to a point in time. For these reports, you never want the underlying data to change after it's been analyzed—that could break the notebook! Data in the original source could've since been deleted or modified; the schema in the original source could've change; the original source may not exist at all!

The current default workflow of Framework does not well support these notebooks, since the cache is .gitignored. If the project is deployed from a different machine than the one that originally deployed the notebook, then the built data will change, which will change what's rendered in the notebook. For these notebooks, you'd ideally want to check the data into git. This means an override in .gitignore, or better, writing to a file that's not in cache/ and loading that in the notebook instead.

Live-But-Static Notebooks

These are notebooks where it is desired that the data reflects the current state of the underlying source, but where the data is the same for all viewers at a given time. You might want to see revenue over time, updated daily, but the query for that data only needs to be made once a day. The query, or processing of the query result, might be expensive so you want to cache the results rather than on each page load.

The current default workflow of Framework also does not well support these notebooks.

The current way is to handle these is to set up deployment on a cron, for example with a GitHub action. However, this has the drawback that you're then re-querying and re-processing all your data on the cadence you set—every minute, or hour, or day, etc.—even for notebooks that no one has accessed and will access for months. As projects grow, for projects with lot of notebooks, many of which will be infrequently viewed, this could end up being very expensive and slow, both for the underlying source database and for the process building and deploying the project.

Imagine your deployment taking five minutes because of an expensive data loader for a notebook that someone views once a month, because you have another notebook that gets viewed every day that needs up-to-date data.

This also means you have to choose between always deploying from a CI/CD setup, like a GitHub action, or being able to deploy from your local machine with yarn deploy. This is because your local machine will likely have cached files that are older than what was last deployed by the last CI/CD run.

Dynamic Notebooks

These notebooks make queries at runtime with values provided by the user. For example, a notebook that lets you paste in a user ID and get an analysis of that single user's data. A company might have millions of users, so it would be infeasible to build every user's data into the deployment ahead of time, and so a query—with parameters—at runtime is necessary.

As far as I can tell, the only way to do this with Framework is to expose your DB credentials or API keys by building them into the bundle that gets sent to the browser, then query the database or other source directly from the browser.

This is obviously far from ideal. It lets any user with access to the project (and any malicious extensions they might have installed) make arbitrary queries and retain credentials even after they've been removed from a project. You loose visibility into the custody of your credentials; instead of giving credentials to only Observable, you're giving credentials to every user that views the project and anything they have installed, and after that you have no idea where they could end up.

This also means that there's no built-in way to cache queries. You could cache in the browser, but you have to build that yourself, and it would only be per-client. You could cache in some external store, but you'd have to build that yourself too (and build the credentials for it into the deployment).


These, I believe, are the three main types of notebooks, and, as I've illustrated, the current design of Framework data loaders is not ideal for any of them.

However, there there are two complimentary changes that that make data loaders perfect for all three types while staying true to the concept of data loaders.

1. Support passing arguments to data loaders in the file name.

const userID = view(Inputs.text({ label: "User ID" }));

const data = await FileAttachment(`./data/userRecordings.csv?id=${userID}`).csv()

2. Run data loaders at runtime instead of build time.

Keep the cache, but don't build it into the deployment. Instead, run each data loader for a deployed project on Observable's servers, and cache the data there.

For point-in-time notebooks this makes it more explicit to users authoring such notebooks that they cannot not rely on their local cache to keep data fixed, and must instead write their data to a file that's checked into git. It lets users always deploy from their local machine, regardless of the requirements of their project's pages.

For live-but-static notebooks this, in combination with the first change, lets you cache and rebuild your data at arbitrary intervals by passing an argument that changes when you want to reload your data.

const currentHour = Math.round(Date.now() / 3600000) * 3600000

const data = await FileAttachment(`./data/topUsers.csv?ms=${currentHour}`).csv()

And for dynamic notebooks, this lets you load data on input change while still using Framework data loaders, and without having to leak sensitive credentials to the browser. The data loader would be run Observable's servers, and would be able to access the team's secrets set in their Observable settings.


I'm loving the new direction Framework is taking Observable in. I believe these two changes are both essential and fit perfectly within Framework's philosophy. ❤️

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions