-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Add xcodec2 model #37868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add xcodec2 model #37868
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ff to ping me once this is ready! |
if padding_mask is not None: | ||
# Expected token length, as in: https://github.com/zhenye234/X-Codec-2.0/blob/ccbbf340ff143dfa6a0ea7cd61ec34a8ba2f1c3d/inference_save_code.py#L89 | ||
audio_length = padding_mask.sum(dim=-1, keepdim=True).cpu() | ||
token_length = audio_length // self.hop_length | ||
codes_padding_mask = torch.zeros(audio_codes.shape, dtype=padding_mask.dtype) | ||
idx = torch.arange(audio_codes.shape[-1]).view(1, -1) | ||
codes_padding_mask = (idx < token_length).to(padding_mask.dtype).to(padding_mask.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using provided padding mask to compute corresponding padding mask in tokens "space"
audio: torch.Tensor, | ||
audio_spectrogram: torch.Tensor, | ||
padding_mask: Optional[torch.Tensor] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed feature extraction, such both spectrogram is computed in feature extractor and passed to the model. Also using the new names 😉
return spectrogram_list | ||
|
||
|
||
def spectrogram_torch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added Torch-equivalent to spectrogram_batched
(namely Mel feature extraction with Kaldi-style pre-processing which I didn't see supported in other torch implementations)
So that torch/GPU is supported by the feature extractor. Could also update SeamlesssM4T
?
""" | ||
Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs | ||
and hence the waveform should not be normalized before feature extraction. | ||
Get mel-filter bank features using Numpy method to mimic Kaldi. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring since it wasn't using TorchAudio!
return y | ||
|
||
|
||
class ISTFTHead(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: could be imported from Vocos when merged: #39403
run-slow: xcodec2 |
This comment contains run-slow, running the specified jobs: models: ['models/xcodec2'] |
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, dac, seamless_m4t, xcodec, xcodec2 |
raise ValueError("Padding must be 'center' or 'same'.") | ||
|
||
|
||
class Xcodec2ISTFTHead(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this and Xcodec2ISTFT
can be imported from Vocos when merged: #39403
run-slow: xcodec2 |
This comment contains run-slow, running the specified jobs: models: ['models/xcodec2'] |
I see that the Could you please provide an estimated timeline for when this PR is expected to be merged into the main branch of transformers? |
What does this PR do?
This PR adds support for XCodec2 a high fidelity general neural audio codec used in Llasa a Text-to-Speech model, to the Transformers library.
This model is composed of 5 components:
This is still a draft PR. Work done so far:
modeling_xcodec2.py
andmodular_xcodec2.py
.Todo
Who can review?
cc: @ArthurZucker
cc: @eustlb @Vaibhavs10 for visibility