Rethink garbage collection

When profiling large pipelines, it seems `gc()` (via `target_gc()`) is a bottleneck, even under the default settings (`tar_option_get("garbage_collection")` equal 1000). When I profiled a million-target pipeline for 30 minutes:

```r
library(targets)
tar_option_set(
  controller = crew::crew_controller_sequential()
)
list(
  tar_target(datasets, seq_len(1e6), memory = "persistent"),
  tar_target(models, datasets, pattern = map(datasets), retrieval = "main")
)
``` 

```r
proffer::pprof(targets::tar_make(callr_function = NULL), seconds_timeout = 30 * 60)
```

I saw:

![Image](https://github.com/user-attachments/assets/38396545-20f7-43be-a60b-a46d3be8032d)

I strongly suspect `gc()` slows way down if there are a large number of small objects in `memory`. This does not bode well for the `targets` local process.

For parallel workers, `gc()` may continue to be important, especially on clusters with shared nodes. But it may not be worth the overhead for the local process in `targets`. I think it would be more optimal, as well as clearer and simpler, to only ever call `gc()` where the target itself runs. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rethink garbage collection #1464

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rethink garbage collection #1464

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions