Proposal: Performance optimizations to reduce unnecessary cloning and improve memory efficiency

Hi maintainers,

I've been reviewing the inference code in DeepFilterNet and noticed several opportunities to reduce memory allocations and avoid redundant data copying. Below are three targeted suggestions that could improve runtime performance, especially on low-resource or real-time systems.

1. Optimize rolling buffer updates using std::mem::swap

Before (per frame):
```self.rolling_spec_buf_y.push_back(self.spec_buf.clone()); // ~4KB copy
self.rolling_spec_buf_x.push_back(self.spec_buf.clone());
```

Proposed optimization:
Use std::mem::swap (O(1)) to avoid cloning:
```std::mem::swap(&mut self.temp_spec_buf, &mut self.spec_buf);
self.rolling_spec_buf_y.push_back(self.temp_spec_buf.clone());  // only one clone if needed
```
Alternatively, if the buffers can be moved instead of cloned, even better—this would eliminate copies entirely.

2. Avoid cloning in synthesis() by accepting slices instead of owned data
Before:
```
state.synthesis(
    spec_ch.to_owned().as_slice_mut().unwrap(), // clones ~4KB
    enh_out_ch.as_slice_mut().unwrap(),
);
```
Proposed change:
Modify synthesis() to accept immutable input:
```
pub fn synthesis(&mut self, input: &[Complex32], output: &mut [f32])
```
Then call without cloning:
```
state.synthesis(
    spec_ch.as_slice().unwrap(), // zero-copy
    enh_out_ch.as_slice_mut().unwrap(),
);
```
This eliminates an unnecessary 4KB allocation per channel per frame.

3. Reuse pre-allocated input buffers for Tract model inference

Before:
```
let mut enc_emb = self.enc.run(tvec!(
    self.erb_buf.clone(),      // ~128 bytes
    TValue::from(self.cplx_buf.clone().into_tensor()...)  // ~4KB
))?;
```
Proposed optimization:
Pre-allocate and reuse the input vector to avoid repeated allocations:
```
let mut enc_input = self.enc_input_buffer.take();
enc_input[0] = self.erb_buf.clone();  
enc_input[1] = TValue::from(...);
let mut enc_emb = self.enc.run(enc_input)?;
```
While cloning may still be unavoidable due to Tract’s ownership requirements, reusing the outer Vec reduces allocator pressure.

These changes aim to minimize per-frame heap allocations and memory copies, which should improve latency and reduce GC pressure (if applicable) or cache misses. I’d be happy to submit a PR if these ideas align with the project’s direction.

Thanks for your great work on DeepFilterNet!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Performance optimizations to reduce unnecessary cloning and improve memory efficiency #669

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposal: Performance optimizations to reduce unnecessary cloning and improve memory efficiency #669

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions