Skip to content

Proposal: Performance optimizations to reduce unnecessary cloning and improve memory efficiency #669

@pcb111111111

Description

@pcb111111111

Hi maintainers,

I've been reviewing the inference code in DeepFilterNet and noticed several opportunities to reduce memory allocations and avoid redundant data copying. Below are three targeted suggestions that could improve runtime performance, especially on low-resource or real-time systems.

  1. Optimize rolling buffer updates using std::mem::swap

Before (per frame):

self.rolling_spec_buf_x.push_back(self.spec_buf.clone());

Proposed optimization:
Use std::mem::swap (O(1)) to avoid cloning:

self.rolling_spec_buf_y.push_back(self.temp_spec_buf.clone());  // only one clone if needed

Alternatively, if the buffers can be moved instead of cloned, even better—this would eliminate copies entirely.

  1. Avoid cloning in synthesis() by accepting slices instead of owned data
    Before:
state.synthesis(
    spec_ch.to_owned().as_slice_mut().unwrap(), // clones ~4KB
    enh_out_ch.as_slice_mut().unwrap(),
);

Proposed change:
Modify synthesis() to accept immutable input:

pub fn synthesis(&mut self, input: &[Complex32], output: &mut [f32])

Then call without cloning:

state.synthesis(
    spec_ch.as_slice().unwrap(), // zero-copy
    enh_out_ch.as_slice_mut().unwrap(),
);

This eliminates an unnecessary 4KB allocation per channel per frame.

  1. Reuse pre-allocated input buffers for Tract model inference

Before:

let mut enc_emb = self.enc.run(tvec!(
    self.erb_buf.clone(),      // ~128 bytes
    TValue::from(self.cplx_buf.clone().into_tensor()...)  // ~4KB
))?;

Proposed optimization:
Pre-allocate and reuse the input vector to avoid repeated allocations:

let mut enc_input = self.enc_input_buffer.take();
enc_input[0] = self.erb_buf.clone();  
enc_input[1] = TValue::from(...);
let mut enc_emb = self.enc.run(enc_input)?;

While cloning may still be unavoidable due to Tract’s ownership requirements, reusing the outer Vec reduces allocator pressure.

These changes aim to minimize per-frame heap allocations and memory copies, which should improve latency and reduce GC pressure (if applicable) or cache misses. I’d be happy to submit a PR if these ideas align with the project’s direction.

Thanks for your great work on DeepFilterNet!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions