Skip to content

Replace pydub as dependency and use multithreaded resampling (exports 20x faster) #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

Bentroen
Copy link
Member

@Bentroen Bentroen commented Dec 21, 2024

This pull request aims to tackle the multiple issues described at #6.

Changes

At its current state, the branch is capable of exporting files up to 20x faster than v0.4.0 by using multithreaded resampling. It replaces pydub (unmaintained and backed by the deprecated audioop module) with python-samplerate, a wrapper for libsamplerate. The methods in use from pydub have been ported to our library's code and adapted to work with numpy arrays, which by itself already makes exports a whole lot faster. But that's not all!

By virtue of a yet unmerged pull request (tuxu/python-samplerate#14), samplerate releases the GIL while resampling, letting us leverage the full power of the CPU:

image

image

It does so by precalculating all the resampling operations that have to be made, for the entire song, and batch-submitting them to concurrent.futures so they can be spread across different CPU threads. As each operation is completed, the resampled sounds are then overlaid into the final song file with the proper volume and panning.
Each operation also stores a context containing every position this sound has to be 'stamped' at in the final audio file, as well as the panning and volume each of those instances have to played at. Segments with the volume and panning applied are also reused in all places that they appear, avoiding many multiplication operations. This was already done in v0.4.0, but as the pitch and panning operations applied in audioop were really slow, this optimization didn't really shine.
These techniques allow for a clever optimization of the export method that allows us to avoid doing a lot of work completely. Knowing details of the structure of a typical note block song, such as the use of a limited pool of variations in pitch, actually proved really useful in designing an efficient system.

It also cleverly avoids unnecessary channel conversions to sync the sound files to the sample rate, sample width and channel count of the mixer. For instance, if mixing at 44.1 kHz and loading an audio file at 48 kHz (e.g. all the instruments added after 1.12), nbswave v0.4.0 would first convert the segment to be at 44.1 kHz (one resampling operation), then resample it again to get it at the proper pitch. This was really wasteful and unnecessary, as it can be done in a single operation - when applying the actual pitch conversion, we just have to multiply the resampling factor by 44100 / 48000.

Additionally, the math used internally in all audio operations now uses float32 arrays, as a consequence of this being the format returned by libsndfile and that libsamplerate works with. This means it's no longer necessary to use an oversized array to avoid clipping, as was previously done (int16 segments mixed in an int32 array), as the float format can overcome clipping entirely.

Finally, the track is now mixed at the target sample rate and channel count, rather than converted/resampled only at the end of the process, which should make both the output more accurate and processing faster.

Results

Here's a comparison of the elapsed export time when exporting the demo file included in the repository:

image
image

For a typical, 3-minute song, exporting shouldn't take longer than 15 seconds at the best resampling method.

With the outlined changes, the export time for the Megacollab file (250k+ notes) has decreased from 8 minutes down to under a minute at the best resampling method available (sinc_best), and just under 40 seconds with the cheapest (linear).

To-do

There are yet a few things missing to polish and explore before merging this pull request:

  • Optimize panning calculation (pydub expected gain to be provided in dB, but the methods are our own now, so we may freely change the implementation)
  • Check if libsoundfile supports all previously supported formats, and whether it can auto-detect the target format based on the filename (a direct call to ffmpeg may be necessary for some of the formats?)
    https://libsndfile.github.io/libsndfile/formats.html
  • Check compatibility of the existing public interface with the new export method
  • Make mono tracks be mixed in mono, and stereo tracks be mixed in stereo (it currently works only with stereo signals internally)
  • Explore further work avoidance by resampling audio segments in mono, and only splitting the channels when panning must be applied (could lead to further optimization, but could also make it slower due to the need of duplicating the signal later in the chain - leading to more operations)
  • Refactor and clean up multithreaded resampling logic
  • Possibly allow the resampling method to be picked at export time (ResamplingMethod enum?)
  • Check memory usage of submitting all resampling operations at once

- The `Mixer` class now uses a 2D numpy array for all operations (I don't know why it didn't use one before) instead of a 1D array with one entry per sample *per channel*.
- Sounds are now processed internally as `float32` instead of `int16` (with an oversized `int32` array for mixing), mostly as a consequence of it being the output format for `samplerate`. Fortunately, it is a good outcome - float arrays can just handle the overflow we throw at them.

This commit strives to change as little as possible of the internal workings, as it is a refactor. But this replacement opens up many opportunities for refactoring further aspects of the code, and even making it more efficient - since we now don't rely on handling audio data the way `pydub` expects us to.

Fixes #6
`nbswave` was 'syncing' all segments to the same sample rate before setting the speed and overlaying them. This is mostly because `pydub` (the library we used before) did it in that way. So we actually resampled most of the notes twice. This, however, is not needed: we can do a single resampling operation by storing the original sound's sample rate and weighing that in when multiplying the speed.

For instance, to make a 48kHz file 1.5x faster with target sample rate 44.1kHz, we do:
`1.5 * 48000 / 44100`
Avoids having to deal with channel conversion/'stacking' at the end, when every individual resampled segment will have to be duplicated.

Attempted to use `np.column_stack`, `np.repeat` and `np.tile` to enforce stereo only when we know it's needed - at the `apply_gain_stereo` method -, and they all cause the bulk of the performance bottleneck when ran over all resampled segments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant