How to efficiently schedule per-user Dagster jobs? #32843
Unanswered
Riahiamirreza
asked this question in
Q&A
Replies: 1 comment
-
|
Hey @Riahiamirreza,
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m building a Dagster orchestration pipeline to process heart-rate (time-series) data for my users. Each user’s data is processed via a separate job. However, many users have little or no new data between ticks, so I don’t want to launch unnecessary runs.
Here’s my current challenge:
ScheduleDefinitionwith a cron schedule (every 10 minutes).execution_fnruns heavy logic: for each user, it checks whether there’s new data since the last run. This check is expensive since the number of users are too much and for each user we need to query database, and it’s taking more than 60 seconds. As a result, the schedule evaluation always times out.My question:
Is this an anti-pattern in Dagster (doing heavy domain logic inside
execution_fn)?What is the recommended best practice:
execution_fn) and increase the timeout (if possible).The second approach is way too simpler, and I initially implemented that, but it soon failed. The number of jobs was too much and it caused many queued jobs after a couple of days. The huge number of queued jobs (about 500k) caused the system to stall. Because everytime Dagster wanted to run a job, it fetched ALL queued jobs from DB and tried to sort them based on priority, and since the number of queued jobs was very high, it caused a long delay to start new jobs.
DAGSTER_SENSOR_GRPC_TIMEOUT_SECONDS, but I'm not sure that applies to schedules as well or not.Beta Was this translation helpful? Give feedback.
All reactions