Skip to content

Rethink garbage collection #1464

@wlandau

Description

@wlandau

When profiling large pipelines, it seems gc() (via target_gc()) is a bottleneck, even under the default settings (tar_option_get("garbage_collection") equal 1000). When I profiled a million-target pipeline for 30 minutes:

library(targets)
tar_option_set(
  controller = crew::crew_controller_sequential()
)
list(
  tar_target(datasets, seq_len(1e6), memory = "persistent"),
  tar_target(models, datasets, pattern = map(datasets), retrieval = "main")
)
proffer::pprof(targets::tar_make(callr_function = NULL), seconds_timeout = 30 * 60)

I saw:

Image

I strongly suspect gc() slows way down if there are a large number of small objects in memory. This does not bode well for the targets local process.

For parallel workers, gc() may continue to be important, especially on clusters with shared nodes. But it may not be worth the overhead for the local process in targets. I think it would be more optimal, as well as clearer and simpler, to only ever call gc() where the target itself runs.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions