-
Notifications
You must be signed in to change notification settings - Fork 76
Closed
Labels
Description
When profiling large pipelines, it seems gc()
(via target_gc()
) is a bottleneck, even under the default settings (tar_option_get("garbage_collection")
equal 1000). When I profiled a million-target pipeline for 30 minutes:
library(targets)
tar_option_set(
controller = crew::crew_controller_sequential()
)
list(
tar_target(datasets, seq_len(1e6), memory = "persistent"),
tar_target(models, datasets, pattern = map(datasets), retrieval = "main")
)
proffer::pprof(targets::tar_make(callr_function = NULL), seconds_timeout = 30 * 60)
I saw:
I strongly suspect gc()
slows way down if there are a large number of small objects in memory
. This does not bode well for the targets
local process.
For parallel workers, gc()
may continue to be important, especially on clusters with shared nodes. But it may not be worth the overhead for the local process in targets
. I think it would be more optimal, as well as clearer and simpler, to only ever call gc()
where the target itself runs.