-
Notifications
You must be signed in to change notification settings - Fork 153
Data loader support #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/dataloader.ts
Outdated
const command = new Promise<void>((resolve, reject) => { | ||
const cacheTempPath = outputPath + ".tmp"; | ||
open(cacheTempPath, "w").then((cacheFd) => { | ||
const cacheFileStream = cacheFd.createWriteStream({highWaterMark: 1024 * 1024}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, what does highWaterMark do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the internal buffer size for the stream. This lets Node automatically regulate backpressure as long as the Streams API is used in a compatible way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to set this to 1M instead of the default 16k? Higher throughput on modern machines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 16k seemed too small to me. We are not doing anything with the data, there's no streaming or parallel processing, so having a large buffer makes sense to me.
Awesome! Here's a friction log of my first tests. I added a FileAttachment reference in my markdown file
I created
I created a loader script
I received a
I fixed the error above by calling All the while, the dev server kept crashing on many of these errors. I think we should be very defensive about any file or process operation (i.e. not assume that because a file was there and executable 1ms ago it is still here now, and still executable); maybe wrap everything in a big try/catch block to make sure we're able 1. to recover and 2. try and report any error to the user? |
In this page I call the same FileAttachment in three places: ```js
display(await FileAttachment("data/insee-communes.json").url());
```
```js
const db = await DuckDBClient.of({communes: FileAttachment("data/insee-communes.json")});
display(Inputs.table(db.query(`FROM communes`)));
```
```js
display(await FileAttachment("data/insee-communes.json").json());
``` The loader itself is: #! /usr/bin/env node
import {spawn} from "node:child_process";
console.warn("starting data loader…");
spawn("/usr/bin/env", [
"duckdb",
"-json",
":memory:",
`
SELECT code_commune_ref
, SUM(nb_adresses)::INT n
, COUNT(*)::INT c
FROM read_parquet('https://static.data.gouv.fr/resources/bureaux-de-vote-et-adresses-de-leurs-electeurs/20230626-135723/table-adresses-reu.parquet')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
`
])
.on("error", (error) => console.error(`data loader error: ${error.message}`))
.on("exit", () => console.warn("ending data loader…"))
.stdout.on("data", (data) => {
console.warn("writing some data");
process.stdout.write(data);
}); This works well once the cache is set; but when I call it with no cache, the server crashes and the client is full of errors. Probably because the loader is called three times? dev server log:
On the page I see: |
c4d336f
to
5d4fa2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! In the future we'll want to add:
- documentation
- as much error handling as possible
but rather merge it sooner so we can play with it more widely.
I fixed the creation of the cache directory which is now .observablehq/cache. I added .sh as a valid extension, but did not remove .js and .ts as discussed since that breaks #! of scripts, so that will need more discussion. |
* build filters files outside the root reverts part of #89 fixes #99 * fix tests and test imports of non-existing files * more tests but I'm not sure if I'm using this correctly * Update test/input/bar.js Co-authored-by: Mike Bostock <[email protected]> * fix test * fix test, align signatures * don’t canReadSync in isLocalPath * syntax error on non-local file path --------- Co-authored-by: Mike Bostock <[email protected]>
This PR implements support for data loaders in the CLI. A data loader is a server-executed script that generates a file that can be referenced as a FileAttachment or static fetch() in a page. A data loader is set up by creating a script file with the name of its target data file, plus a .js or .ts extension. For example, to generate a file
data.csv
, you would create a data loader script atdata.csv.js
. When the server receives the file request fordata.csv
, if it doesn't exist, it will look fordata.csv.js
ordata.csv.ts
, and if one of those is found, it will be run with its output saved into a generated files directory, and from then on that generated file will be returned as the value fordata.csv
.