Skip to content

Data loader support #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 3, 2023
Merged

Data loader support #89

merged 9 commits into from
Nov 3, 2023

Conversation

wiltsecarpenter
Copy link
Contributor

@wiltsecarpenter wiltsecarpenter commented Nov 2, 2023

This PR implements support for data loaders in the CLI. A data loader is a server-executed script that generates a file that can be referenced as a FileAttachment or static fetch() in a page. A data loader is set up by creating a script file with the name of its target data file, plus a .js or .ts extension. For example, to generate a file data.csv, you would create a data loader script at data.csv.js. When the server receives the file request for data.csv, if it doesn't exist, it will look for data.csv.js or data.csv.ts, and if one of those is found, it will be run with its output saved into a generated files directory, and from then on that generated file will be returned as the value for data.csv.

@wiltsecarpenter wiltsecarpenter linked an issue Nov 2, 2023 that may be closed by this pull request
@wiltsecarpenter wiltsecarpenter marked this pull request as ready for review November 3, 2023 00:14
const command = new Promise<void>((resolve, reject) => {
const cacheTempPath = outputPath + ".tmp";
open(cacheTempPath, "w").then((cacheFd) => {
const cacheFileStream = cacheFd.createWriteStream({highWaterMark: 1024 * 1024});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, what does highWaterMark do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the internal buffer size for the stream. This lets Node automatically regulate backpressure as long as the Streams API is used in a compatible way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to set this to 1M instead of the default 16k? Higher throughput on modern machines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 16k seemed too small to me. We are not doing anything with the data, there's no streaming or parallel processing, so having a large buffer makes sense to me.

@Fil
Copy link
Contributor

Fil commented Nov 3, 2023

Awesome! Here's a friction log of my first tests.

I added a FileAttachment reference in my markdown file

```js
display(await FileAttachment("lol.csv").text());
```

I created _cache/docs/ manually

  • suggestion: create it automatically
  • _cache/ is currently a sibling of docs/, maybe it could be made configurable in the future (docs/ might be on an unwritable mount)
  • suggestion: add _cache to .gitignore

I created a loader script lol.csv.js in the docs/ directory

  • suggestion: document where it must be created (this issue's description only mentions its name).
  • suggestion: document that you can have a dataloader in docs/data/happy.csv.ts if you reference FileAttachment("data/happy.csv")

I received a 404 HttpError: Data loader is not executable error; however it was showing in the preview page as "Not found"

  • suggestion: displaying the actual server error string in the page would be helpful

I fixed the error above by calling chmod +x docs/lol.csv.js, then I received a new error: spawn Unknown system error -8. I knew it was because my dataloader script did not yet begin with #! /bin/env node (it was a placeholder file), but it would be helpful for folks if we had a nicer error message.

All the while, the dev server kept crashing on many of these errors. I think we should be very defensive about any file or process operation (i.e. not assume that because a file was there and executable 1ms ago it is still here now, and still executable); maybe wrap everything in a big try/catch block to make sure we're able 1. to recover and 2. try and report any error to the user?

@Fil
Copy link
Contributor

Fil commented Nov 3, 2023

In this page I call the same FileAttachment in three places:

```js
display(await FileAttachment("data/insee-communes.json").url());
```

```js
const db = await DuckDBClient.of({communes: FileAttachment("data/insee-communes.json")});
display(Inputs.table(db.query(`FROM communes`)));
```

```js
display(await FileAttachment("data/insee-communes.json").json());
```

The loader itself is:

#! /usr/bin/env node
import {spawn} from "node:child_process";

console.warn("starting data loader…");
spawn("/usr/bin/env", [
  "duckdb",
  "-json",
  ":memory:",
  `
SELECT code_commune_ref
     , SUM(nb_adresses)::INT n
     , COUNT(*)::INT c
  FROM read_parquet('https://static.data.gouv.fr/resources/bureaux-de-vote-et-adresses-de-leurs-electeurs/20230626-135723/table-adresses-reu.parquet')
 GROUP BY 1
 ORDER BY 2 DESC
 LIMIT 100
  `
])
  .on("error", (error) => console.error(`data loader error: ${error.message}`))
  .on("exit", () => console.warn("ending data loader…"))
  .stdout.on("data", (data) => {
    console.warn("writing some data");
    process.stdout.write(data);
  });

This works well once the cache is set; but when I call it with no cache, the server crashes and the client is full of errors. Probably because the loader is called three times?

dev server log:

Server running at http://127.0.0.1:3000/
...
GET /_file/data/insee-communes.json
starting data loader…
HEAD /_file/data/insee-communes.json
starting data loader…
ending data loader…
ending data loader…
node:fs:1047
  handleErrorFromBinding(ctx);
  ^

Error: ENOENT: no such file or directory, rename '_cache/docs/data/insee-communes.json.tmp' -> '_cache/docs/data/insee-communes.json'
    at renameSync (node:fs:1047:3)
    at <anonymous> (/Users/fil/Source/cli/src/dataloader.ts:29:15) {
  errno: -2,
  syscall: 'rename',
  code: 'ENOENT',
  path: '_cache/docs/data/insee-communes.json.tmp',
  dest: '_cache/docs/data/insee-communes.json'
}

On the page I see:
Error: Opening file 'data/insee-communes.json' failed with error: NetworkError: Failed to execute 'send' on 'XMLHttpRequest': Failed to load 'http://127.0.0.1:3000/_file/data/insee-communes.json'.

@wiltsecarpenter wiltsecarpenter changed the title First cut and data loader support Data loader support Nov 3, 2023
Copy link
Contributor

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! In the future we'll want to add:

  • documentation
  • as much error handling as possible

but rather merge it sooner so we can play with it more widely.

@wiltsecarpenter
Copy link
Contributor Author

I fixed the creation of the cache directory which is now .observablehq/cache. I added .sh as a valid extension, but did not remove .js and .ts as discussed since that breaks #! of scripts, so that will need more discussion.

@wiltsecarpenter wiltsecarpenter merged commit e715830 into main Nov 3, 2023
@wiltsecarpenter wiltsecarpenter deleted the wiltse/loaders branch November 3, 2023 20:27
@cinxmo cinxmo mentioned this pull request Nov 3, 2023
Fil added a commit that referenced this pull request Nov 6, 2023
This was referenced Nov 6, 2023
mbostock added a commit that referenced this pull request Nov 9, 2023
* build filters files outside the root
reverts part of #89

fixes #99

* fix tests and test imports of non-existing files

* more tests but I'm not sure if I'm using this correctly

* Update test/input/bar.js

Co-authored-by: Mike Bostock <[email protected]>

* fix test

* fix test, align signatures

* don’t canReadSync in isLocalPath

* syntax error on non-local file path

---------

Co-authored-by: Mike Bostock <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data loaders
3 participants