-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Add Descript-Audio-Codec model #31494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
fe0e49b
dac model
kamilakesbi b985da9
original dac works
kamilakesbi 47afe66
add dac model
kamilakesbi f42af54
dac can be instatiated
kamilakesbi 255e479
add forward pass
kamilakesbi 44aa197
load weights
kamilakesbi a1b6b2b
all weights are used
kamilakesbi 31b1e6f
convert checkpoint script ready
kamilakesbi 375f826
test
kamilakesbi 21a7146
add feature extractor
kamilakesbi c4dce70
up
kamilakesbi 59166dd
make style
kamilakesbi 27811d7
apply cookicutter
kamilakesbi 1408e3a
fix tests
kamilakesbi 4366412
iterate on FeatureExtractor
kamilakesbi ace2197
nit
kamilakesbi a563b4f
update dac doc
kamilakesbi 3c21b38
replace nn.Sequential with nn.ModuleList
kamilakesbi 21072a9
nit
kamilakesbi 95d0d18
apply review suggestions 1/2
kamilakesbi cae002f
Update src/transformers/models/dac/modeling_dac.py
kamilakesbi 6b52abe
up
kamilakesbi af9cd69
apply review suggestions 2/2
kamilakesbi bf09ca8
update padding in FeatureExtractor
kamilakesbi 54a1ec6
apply review suggestions
kamilakesbi 3bc40c6
iterate on design and tests
kamilakesbi 01511b7
add integration tests
kamilakesbi 5cdf0ae
feature extractor tests
kamilakesbi 167cb8f
make style
kamilakesbi a4d1261
all tests pass
kamilakesbi 1fd2496
make style
kamilakesbi 09ec8b5
fixup
kamilakesbi a5ac7c6
apply review suggestions
kamilakesbi 284c75b
fix-copies
kamilakesbi 7512886
apply review suggestions
kamilakesbi dc2e85c
apply review suggestions
kamilakesbi c7318d5
Update docs/source/en/model_doc/dac.md
kamilakesbi fdb8ced
Update docs/source/en/model_doc/dac.md
kamilakesbi 5388663
anticipate transfer weights to descript
kamilakesbi fac14fd
up
kamilakesbi e088e0d
make style
kamilakesbi bfaef5e
apply review suggestions
kamilakesbi a473975
update slow test values
kamilakesbi 2be0f36
update slow tests
kamilakesbi c13180e
update test values
kamilakesbi 8c72cda
update with CI values
kamilakesbi 89b7143
update with vorace values
kamilakesbi 5b02249
update test with slice
kamilakesbi 1671917
make style
kamilakesbi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
|
||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
rendered properly in your Markdown viewer. | ||
|
||
--> | ||
|
||
# DAC | ||
|
||
## Overview | ||
|
||
|
||
The DAC model was proposed in [Descript Audio Codec: High-Fidelity Audio Compression with Improved RVQGAN](https://arxiv.org/abs/2306.06546) by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar. | ||
kamilakesbi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets. | ||
|
||
The abstract from the paper is the following: | ||
|
||
*Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.* | ||
|
||
This model was contributed by [Kamil Akesbi](https://huggingface.co/kamilakesbi). | ||
The original code can be found [here](https://github.com/descriptinc/descript-audio-codec/tree/main?tab=readme-ov-file). | ||
|
||
|
||
## Model structure | ||
|
||
The Descript Audio Codec (DAC) model is structured into three distinct stages: | ||
|
||
1. Encoder Model: This stage compresses the input audio, reducing its size while retaining essential information. | ||
2. Residual Vector Quantizer (RVQ) Model: Working in tandem with the encoder, this model quantizes the latent codes of the audio, refining the compression and ensuring high-quality reconstruction. | ||
3. Decoder Model: This final stage reconstructs the audio from its compressed form, restoring it to a state that closely resembles the original input. | ||
|
||
## Usage example | ||
|
||
Here is a quick example of how to encode and decode an audio using this model: | ||
|
||
```python | ||
kamilakesbi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
>>> from datasets import load_dataset, Audio | ||
>>> from transformers import DacModel, AutoProcessor | ||
>>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") | ||
|
||
>>> model = DacModel.from_pretrained("descript/dac_16khz") | ||
>>> processor = AutoProcessor.from_pretrained("descript/dac_16khz") | ||
>>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate)) | ||
>>> audio_sample = librispeech_dummy[-1]["audio"]["array"] | ||
>>> inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt") | ||
|
||
>>> encoder_outputs = model.encode(inputs["input_values"]) | ||
>>> # Get the intermediate audio codes | ||
>>> audio_codes = encoder_outputs.audio_codes | ||
>>> # Reconstruct the audio from its quantized representation | ||
>>> audio_values = model.decode(encoder_outputs.quantized_representation) | ||
>>> # or the equivalent with a forward pass | ||
>>> audio_values = model(inputs["input_values"]).audio_values | ||
``` | ||
|
||
## DacConfig | ||
|
||
[[autodoc]] DacConfig | ||
|
||
## DacFeatureExtractor | ||
|
||
[[autodoc]] DacFeatureExtractor | ||
- __call__ | ||
|
||
## DacModel | ||
|
||
[[autodoc]] DacModel | ||
- decode | ||
- encode | ||
- forward |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,6 +59,7 @@ | |
cpmant, | ||
ctrl, | ||
cvt, | ||
dac, | ||
data2vec, | ||
dbrx, | ||
deberta, | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# coding=utf-8 | ||
# Copyright 2024 Descript and The HuggingFace Inc. team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from typing import TYPE_CHECKING | ||
|
||
from ...utils import ( | ||
OptionalDependencyNotAvailable, | ||
_LazyModule, | ||
is_torch_available, | ||
) | ||
|
||
|
||
_import_structure = { | ||
"configuration_dac": ["DacConfig"], | ||
"feature_extraction_dac": ["DacFeatureExtractor"], | ||
} | ||
|
||
try: | ||
if not is_torch_available(): | ||
raise OptionalDependencyNotAvailable() | ||
except OptionalDependencyNotAvailable: | ||
pass | ||
else: | ||
_import_structure["modeling_dac"] = [ | ||
"DacModel", | ||
"DacPreTrainedModel", | ||
] | ||
|
||
if TYPE_CHECKING: | ||
from .configuration_dac import ( | ||
DacConfig, | ||
) | ||
from .feature_extraction_dac import DacFeatureExtractor | ||
|
||
try: | ||
if not is_torch_available(): | ||
raise OptionalDependencyNotAvailable() | ||
except OptionalDependencyNotAvailable: | ||
pass | ||
else: | ||
from .modeling_dac import ( | ||
DacModel, | ||
DacPreTrainedModel, | ||
) | ||
|
||
else: | ||
import sys | ||
|
||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
# coding=utf-8 | ||
# Copyright 2024 Descript and The HuggingFace Inc. team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""Dac model configuration""" | ||
|
||
import math | ||
|
||
import numpy as np | ||
|
||
from ...configuration_utils import PretrainedConfig | ||
from ...utils import logging | ||
|
||
|
||
logger = logging.get_logger(__name__) | ||
|
||
|
||
class DacConfig(PretrainedConfig): | ||
kamilakesbi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
kamilakesbi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
r""" | ||
This is the configuration class to store the configuration of an [`DacModel`]. It is used to instantiate a | ||
Dac model according to the specified arguments, defining the model architecture. Instantiating a configuration | ||
with the defaults will yield a similar configuration to that of the | ||
[descript/dac_16khz](https://huggingface.co/descript/dac_16khz) architecture. | ||
|
||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the | ||
documentation from [`PretrainedConfig`] for more information. | ||
|
||
Args: | ||
encoder_hidden_size (`int`, *optional*, defaults to 64): | ||
Intermediate representation dimension for the encoder. | ||
downsampling_ratios (`List[int]`, *optional*, defaults to `[2, 4, 8, 8]`): | ||
Ratios for downsampling in the encoder. These are used in reverse order for upsampling in the decoder. | ||
decoder_hidden_size (`int`, *optional*, defaults to 1536): | ||
Intermediate representation dimension for the decoder. | ||
n_codebooks (`int`, *optional*, defaults to 9): | ||
Number of codebooks in the VQVAE. | ||
codebook_size (`int`, *optional*, defaults to 1024): | ||
Number of discrete codes in each codebook. | ||
codebook_dim (`int`, *optional*, defaults to 8): | ||
Dimension of the codebook vectors. If not defined, uses `encoder_hidden_size`. | ||
quantizer_dropout (`bool`, *optional*, defaults to 0): | ||
Whether to apply dropout to the quantizer. | ||
commitment_loss_weight (float, *optional*, defaults to 0.25): | ||
Weight of the commitment loss term in the VQVAE loss function. | ||
codebook_loss_weight (float, *optional*, defaults to 1.0): | ||
Weight of the codebook loss term in the VQVAE loss function. | ||
sampling_rate (`int`, *optional*, defaults to 16000): | ||
The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz). | ||
Example: | ||
|
||
```python | ||
>>> from transformers import DacModel, DacConfig | ||
|
||
>>> # Initializing a "descript/dac_16khz" style configuration | ||
>>> configuration = DacConfig() | ||
|
||
>>> # Initializing a model (with random weights) from the "descript/dac_16khz" style configuration | ||
>>> model = DacModel(configuration) | ||
|
||
>>> # Accessing the model configuration | ||
>>> configuration = model.config | ||
```""" | ||
|
||
model_type = "dac" | ||
|
||
def __init__( | ||
self, | ||
encoder_hidden_size=64, | ||
downsampling_ratios=[2, 4, 8, 8], | ||
decoder_hidden_size=1536, | ||
n_codebooks=9, | ||
codebook_size=1024, | ||
codebook_dim=8, | ||
quantizer_dropout=0, | ||
commitment_loss_weight=0.25, | ||
codebook_loss_weight=1.0, | ||
sampling_rate=16000, | ||
**kwargs, | ||
): | ||
self.encoder_hidden_size = encoder_hidden_size | ||
self.downsampling_ratios = downsampling_ratios | ||
self.decoder_hidden_size = decoder_hidden_size | ||
self.upsampling_ratios = downsampling_ratios[::-1] | ||
self.n_codebooks = n_codebooks | ||
self.codebook_size = codebook_size | ||
self.codebook_dim = codebook_dim | ||
self.quantizer_dropout = quantizer_dropout | ||
self.sampling_rate = sampling_rate | ||
|
||
self.hidden_size = encoder_hidden_size * (2 ** len(downsampling_ratios)) | ||
|
||
self.hop_length = int(np.prod(downsampling_ratios)) | ||
self.commitment_loss_weight = commitment_loss_weight | ||
self.codebook_loss_weight = codebook_loss_weight | ||
|
||
super().__init__(**kwargs) | ||
|
||
@property | ||
def frame_rate(self) -> int: | ||
hop_length = np.prod(self.upsampling_ratios) | ||
return math.ceil(self.sampling_rate / hop_length) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.