Skip to content

Conversation

Deep-unlearning
Copy link

@Deep-unlearning Deep-unlearning commented Apr 29, 2025

What does this PR do?

This PR adds support for XCodec2 a high fidelity general neural audio codec used in Llasa a Text-to-Speech model, to the Transformers library.

This model is composed of 5 components:

  • A Semantic Encoder
  • An Acoustic Encoder
  • A VectorQuantizer
  • A Semantic Decoder
  • An Acoustic Decoder

This is still a draft PR. Work done so far:

  • Adapted the model to Transformers format in modeling_xcodec2.py and modular_xcodec2.py.

Todo

  • Add the checkpoint conversion scripts and push to the hub
  • Support batch inference
  • Write Tests
  • Add documentation

Who can review?

cc: @ArthurZucker
cc: @eustlb @Vaibhavs10 for visibility

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Collaborator

ff to ping me once this is ready!

@Deep-unlearning Deep-unlearning changed the title [WiP] Add xcodec model [WiP] Add xcodec2 model Jun 3, 2025
@Deep-unlearning Deep-unlearning marked this pull request as ready for review June 5, 2025 16:03
@ArthurZucker ArthurZucker removed the request for review from Rocketknight1 July 7, 2025 12:05
Comment on lines +1542 to +1548
if padding_mask is not None:
# Expected token length, as in: https://github.com/zhenye234/X-Codec-2.0/blob/ccbbf340ff143dfa6a0ea7cd61ec34a8ba2f1c3d/inference_save_code.py#L89
audio_length = padding_mask.sum(dim=-1, keepdim=True).cpu()
token_length = audio_length // self.hop_length
codes_padding_mask = torch.zeros(audio_codes.shape, dtype=padding_mask.dtype)
idx = torch.arange(audio_codes.shape[-1]).view(1, -1)
codes_padding_mask = (idx < token_length).to(padding_mask.dtype).to(padding_mask.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using provided padding mask to compute corresponding padding mask in tokens "space"

Comment on lines +1492 to +1494
audio: torch.Tensor,
audio_spectrogram: torch.Tensor,
padding_mask: Optional[torch.Tensor] = None,
Copy link
Contributor

@ebezzam ebezzam Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed feature extraction, such both spectrogram is computed in feature extractor and passed to the model. Also using the new names 😉

return spectrogram_list


def spectrogram_torch(
Copy link
Contributor

@ebezzam ebezzam Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Torch-equivalent to spectrogram_batched (namely Mel feature extraction with Kaldi-style pre-processing which I didn't see supported in other torch implementations)

So that torch/GPU is supported by the feature extractor. Could also update SeamlesssM4T?

"""
Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs
and hence the waveform should not be normalized before feature extraction.
Get mel-filter bank features using Numpy method to mimic Kaldi.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docstring since it wasn't using TorchAudio!

return y


class ISTFTHead(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: could be imported from Vocos when merged: #39403

@ebezzam
Copy link
Contributor

ebezzam commented Oct 3, 2025

run-slow: xcodec2

Copy link
Contributor

github-actions bot commented Oct 3, 2025

This comment contains run-slow, running the specified jobs:

models: ['models/xcodec2']
quantizations: [] ...

Copy link
Contributor

github-actions bot commented Oct 3, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, dac, seamless_m4t, xcodec, xcodec2

raise ValueError("Padding must be 'center' or 'same'.")


class Xcodec2ISTFTHead(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this and Xcodec2ISTFT can be imported from Vocos when merged: #39403

@ebezzam ebezzam mentioned this pull request Oct 3, 2025
@ebezzam
Copy link
Contributor

ebezzam commented Oct 7, 2025

run-slow: xcodec2

Copy link
Contributor

github-actions bot commented Oct 7, 2025

This comment contains run-slow, running the specified jobs:

models: ['models/xcodec2']
quantizations: [] ...

@ryan-minato
Copy link

I see that the Xcodec2Model weights have already been uploaded to the Hugging Face Hub under hf-audio/xcodec2.
I am currently developing a training framework for XCodec2 and would like to confirm if the API of Xcodec2Model in this Pull Request is now largely finalized/stable.

Could you please provide an estimated timeline for when this PR is expected to be merged into the main branch of transformers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants