Data loader support #89

wiltsecarpenter · 2023-11-02T16:59:13Z

This PR implements support for data loaders in the CLI. A data loader is a server-executed script that generates a file that can be referenced as a FileAttachment or static fetch() in a page. A data loader is set up by creating a script file with the name of its target data file, plus a .js or .ts extension. For example, to generate a file data.csv, you would create a data loader script at data.csv.js. When the server receives the file request for data.csv, if it doesn't exist, it will look for data.csv.js or data.csv.ts, and if one of those is found, it will be run with its output saved into a generated files directory, and from then on that generated file will be returned as the value for data.csv.

…e/loaders

Fil · 2023-11-03T08:49:48Z

src/dataloader.ts

+  const command = new Promise<void>((resolve, reject) => {
+    const cacheTempPath = outputPath + ".tmp";
+    open(cacheTempPath, "w").then((cacheFd) => {
+      const cacheFileStream = cacheFd.createWriteStream({highWaterMark: 1024 * 1024});


I'm curious, what does highWaterMark do?

It's the internal buffer size for the stream. This lets Node automatically regulate backpressure as long as the Streams API is used in a compatible way.

Why do we need to set this to 1M instead of the default 16k? Higher throughput on modern machines?

Yes, 16k seemed too small to me. We are not doing anything with the data, there's no streaming or parallel processing, so having a large buffer makes sense to me.

src/dataloader.ts

Fil · 2023-11-03T10:08:27Z

Awesome! Here's a friction log of my first tests.

I added a FileAttachment reference in my markdown file

```js
display(await FileAttachment("lol.csv").text());
```

I created _cache/docs/ manually

suggestion: create it automatically
_cache/ is currently a sibling of docs/, maybe it could be made configurable in the future (docs/ might be on an unwritable mount)
suggestion: add _cache to .gitignore

I created a loader script lol.csv.js in the docs/ directory

suggestion: document where it must be created (this issue's description only mentions its name).
suggestion: document that you can have a dataloader in docs/data/happy.csv.ts if you reference FileAttachment("data/happy.csv")

I received a 404 HttpError: Data loader is not executable error; however it was showing in the preview page as "Not found"

suggestion: displaying the actual server error string in the page would be helpful

I fixed the error above by calling chmod +x docs/lol.csv.js, then I received a new error: spawn Unknown system error -8. I knew it was because my dataloader script did not yet begin with #! /bin/env node (it was a placeholder file), but it would be helpful for folks if we had a nicer error message.

All the while, the dev server kept crashing on many of these errors. I think we should be very defensive about any file or process operation (i.e. not assume that because a file was there and executable 1ms ago it is still here now, and still executable); maybe wrap everything in a big try/catch block to make sure we're able 1. to recover and 2. try and report any error to the user?

Fil · 2023-11-03T10:46:44Z

In this page I call the same FileAttachment in three places:

```js
display(await FileAttachment("data/insee-communes.json").url());
```

```js
const db = await DuckDBClient.of({communes: FileAttachment("data/insee-communes.json")});
display(Inputs.table(db.query(`FROM communes`)));
```

```js
display(await FileAttachment("data/insee-communes.json").json());
```

The loader itself is:

#! /usr/bin/env node
import {spawn} from "node:child_process";

console.warn("starting data loader…");
spawn("/usr/bin/env", [
  "duckdb",
  "-json",
  ":memory:",
  `
SELECT code_commune_ref
     , SUM(nb_adresses)::INT n
     , COUNT(*)::INT c
  FROM read_parquet('https://static.data.gouv.fr/resources/bureaux-de-vote-et-adresses-de-leurs-electeurs/20230626-135723/table-adresses-reu.parquet')
 GROUP BY 1
 ORDER BY 2 DESC
 LIMIT 100
  `
])
  .on("error", (error) => console.error(`data loader error: ${error.message}`))
  .on("exit", () => console.warn("ending data loader…"))
  .stdout.on("data", (data) => {
    console.warn("writing some data");
    process.stdout.write(data);
  });

This works well once the cache is set; but when I call it with no cache, the server crashes and the client is full of errors. Probably because the loader is called three times?

dev server log:

Server running at http://127.0.0.1:3000/
...
GET /_file/data/insee-communes.json
starting data loader…
HEAD /_file/data/insee-communes.json
starting data loader…
ending data loader…
ending data loader…
node:fs:1047
  handleErrorFromBinding(ctx);
  ^

Error: ENOENT: no such file or directory, rename '_cache/docs/data/insee-communes.json.tmp' -> '_cache/docs/data/insee-communes.json'
    at renameSync (node:fs:1047:3)
    at <anonymous> (/Users/fil/Source/cli/src/dataloader.ts:29:15) {
  errno: -2,
  syscall: 'rename',
  code: 'ENOENT',
  path: '_cache/docs/data/insee-communes.json.tmp',
  dest: '_cache/docs/data/insee-communes.json'
}

On the page I see:
Error: Opening file 'data/insee-communes.json' failed with error: NetworkError: Failed to execute 'send' on 'XMLHttpRequest': Failed to load 'http://127.0.0.1:3000/_file/data/insee-communes.json'.

Fil

lgtm! In the future we'll want to add:

documentation
as much error handling as possible

but rather merge it sooner so we can play with it more widely.

wiltsecarpenter · 2023-11-03T20:21:57Z

I fixed the creation of the cache directory which is now .observablehq/cache. I added .sh as a valid extension, but did not remove .js and .ts as discussed since that breaks #! of scripts, so that will need more discussion.

reverts part of #89 fixes #99

* build filters files outside the root reverts part of #89 fixes #99 * fix tests and test imports of non-existing files * more tests but I'm not sure if I'm using this correctly * Update test/input/bar.js Co-authored-by: Mike Bostock <[email protected]> * fix test * fix test, align signatures * don’t canReadSync in isLocalPath * syntax error on non-local file path --------- Co-authored-by: Mike Bostock <[email protected]>

First cut and data loader support

3203b0c

wiltsecarpenter linked an issue Nov 2, 2023 that may be closed by this pull request

Data loaders #38

Closed

wiltsecarpenter added 2 commits November 2, 2023 11:22

Respect async on close()

72749e8

Merge branch 'main' of https://github.com/observablehq/cli into wilts…

79a3508

…e/loaders

cinxmo assigned wiltsecarpenter Nov 2, 2023

wiltsecarpenter added 4 commits November 2, 2023 16:20

Switch back to watch instead of watchFile

a02ffa8

Fix type error

f5bb91b

Update tests

c7268dd

Add build support for data loaders

24a9e48

wiltsecarpenter marked this pull request as ready for review November 3, 2023 00:14

Fil reviewed Nov 3, 2023

View reviewed changes

src/dataloader.ts Outdated Show resolved Hide resolved

wiltsecarpenter requested review from Fil and mootari November 3, 2023 16:02

wiltsecarpenter changed the title ~~First cut and data loader support~~ Data loader support Nov 3, 2023

Move data loader cache to .observablehq/cache

5d4fa2f

wiltsecarpenter force-pushed the wiltse/loaders branch from c4d336f to 5d4fa2f Compare November 3, 2023 17:55

Add .sh extension

78fd3b2

Fil approved these changes Nov 3, 2023

View reviewed changes

wiltsecarpenter merged commit e715830 into main Nov 3, 2023

wiltsecarpenter deleted the wiltse/loaders branch November 3, 2023 20:27

cinxmo mentioned this pull request Nov 3, 2023

v0.0.34 #93

Merged

mbostock mentioned this pull request Nov 4, 2023

observable build allows files outside of the source root to be copied #99

Closed

Fil added a commit that referenced this pull request Nov 6, 2023

build filters files outside the root

5011e82

reverts part of #89 fixes #99

This was referenced Nov 6, 2023

build filters files outside the root #103

Merged

code style #94

Merged

CobusT unassigned wiltsecarpenter Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data loader support #89

Data loader support #89

Uh oh!

wiltsecarpenter commented Nov 2, 2023 •

edited

Loading

Uh oh!

Fil Nov 3, 2023

Uh oh!

mootari Nov 3, 2023

Uh oh!

Fil Nov 3, 2023

Uh oh!

wiltsecarpenter Nov 3, 2023

Uh oh!

Uh oh!

Fil commented Nov 3, 2023 •

edited

Loading

Uh oh!

Fil commented Nov 3, 2023 •

edited

Loading

Uh oh!

Fil left a comment

Uh oh!

wiltsecarpenter commented Nov 3, 2023

Uh oh!

Uh oh!

Data loader support #89

Data loader support #89

Uh oh!

Conversation

wiltsecarpenter commented Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fil Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

mootari Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

Fil Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

wiltsecarpenter Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fil commented Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fil commented Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fil left a comment

Choose a reason for hiding this comment

Uh oh!

wiltsecarpenter commented Nov 3, 2023

Uh oh!

Uh oh!

wiltsecarpenter commented Nov 2, 2023 •

edited

Loading

Fil commented Nov 3, 2023 •

edited

Loading

Fil commented Nov 3, 2023 •

edited

Loading