Add xcodec2 model #37868

Deep-unlearning · 2025-04-29T14:58:31Z

What does this PR do?

This PR adds support for XCodec2 a high fidelity general neural audio codec used in Llasa a Text-to-Speech model, to the Transformers library.

This model is composed of 5 components:

A Semantic Encoder
An Acoustic Encoder
A VectorQuantizer
A Semantic Decoder
An Acoustic Decoder

This is still a draft PR. Work done so far:

Adapted the model to Transformers format in modeling_xcodec2.py and modular_xcodec2.py.

Todo

Add the checkpoint conversion scripts and push to the hub
Support batch inference
Write Tests
Add documentation

Who can review?

cc: @ArthurZucker
cc: @eustlb @Vaibhavs10 for visibility

HuggingFaceDocBuilderDev · 2025-04-29T15:11:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2025-04-30T14:12:11Z

ff to ping me once this is ready!

ebezzam · 2025-10-02T15:51:09Z

src/transformers/models/xcodec2/modular_xcodec2.py

+        if padding_mask is not None:
+            # Expected token length, as in: https://github.com/zhenye234/X-Codec-2.0/blob/ccbbf340ff143dfa6a0ea7cd61ec34a8ba2f1c3d/inference_save_code.py#L89
+            audio_length = padding_mask.sum(dim=-1, keepdim=True).cpu()
+            token_length = audio_length // self.hop_length
+            codes_padding_mask = torch.zeros(audio_codes.shape, dtype=padding_mask.dtype)
+            idx = torch.arange(audio_codes.shape[-1]).view(1, -1)
+            codes_padding_mask = (idx < token_length).to(padding_mask.dtype).to(padding_mask.device)


Using provided padding mask to compute corresponding padding mask in tokens "space"

ebezzam · 2025-10-02T15:57:58Z

src/transformers/models/xcodec2/modular_xcodec2.py

+        audio: torch.Tensor,
+        audio_spectrogram: torch.Tensor,
+        padding_mask: Optional[torch.Tensor] = None,


Removed feature extraction, such both spectrogram is computed in feature extractor and passed to the model. Also using the new names 😉

ebezzam · 2025-10-03T12:49:25Z

src/transformers/audio_utils.py

    return spectrogram_list


+def spectrogram_torch(


Added Torch-equivalent to spectrogram_batched (namely Mel feature extraction with Kaldi-style pre-processing which I didn't see supported in other torch implementations)

So that torch/GPU is supported by the feature extractor. Could also update SeamlesssM4T?

ebezzam · 2025-10-03T13:56:28Z

src/transformers/models/seamless_m4t/feature_extraction_seamless_m4t.py

        """
-        Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs
-        and hence the waveform should not be normalized before feature extraction.
+        Get mel-filter bank features using Numpy method to mimic Kaldi.


Update docstring since it wasn't using TorchAudio!

ebezzam · 2025-10-03T14:03:59Z

src/transformers/models/xcodec2/modular_xcodec2.py

+        return y
+
+
+class ISTFTHead(nn.Module):


Note to self: could be imported from Vocos when merged: #39403

ebezzam · 2025-10-03T14:27:03Z

run-slow: xcodec2

github-actions · 2025-10-03T14:28:35Z

This comment contains run-slow, running the specified jobs:

models: ['models/xcodec2']
quantizations: [] ...

github-actions · 2025-10-03T16:10:32Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, dac, seamless_m4t, xcodec, xcodec2

ebezzam · 2025-10-03T16:12:26Z

src/transformers/models/xcodec2/modular_xcodec2.py

+            raise ValueError("Padding must be 'center' or 'same'.")
+
+
+class Xcodec2ISTFTHead(nn.Module):


Note to self: this and Xcodec2ISTFT can be imported from Vocos when merged: #39403

ebezzam · 2025-10-07T07:40:12Z

run-slow: xcodec2

github-actions · 2025-10-07T07:41:42Z

This comment contains run-slow, running the specified jobs:

models: ['models/xcodec2']
quantizations: [] ...

ryan-minato · 2025-10-21T01:11:37Z

I see that the Xcodec2Model weights have already been uploaded to the Hugging Face Hub under hf-audio/xcodec2.
I am currently developing a training framework for XCodec2 and would like to confirm if the API of Xcodec2Model in this Pull Request is now largely finalized/stable.

Could you please provide an estimated timeline for when this PR is expected to be merged into the main branch of transformers?

Add xcodec model

277a96f

code formatting

349feae

Deep-unlearning and others added 13 commits May 20, 2025 18:11

typo xcodec2 name

e5f1da8

add xcodec2 in init file

fc0907c

fix import

ea0acbf

fix weight_norm init

8542db7

remove unused import

e98d981

add convert file

d1cd3ac

add ModelOutput class

74fa506

nit

02f5c94

fix device issue

dd0a17c

fix forward

3786203

nit

93dbfad

doc draft

d4d8c6a

draft test

c40912e

Deep-unlearning changed the title ~~[WiP] Add xcodec model~~ [WiP] Add xcodec2 model Jun 3, 2025

Deep-unlearning and others added 6 commits June 3, 2025 16:23

match tensor with the orignal implementation

17eb48c

Add doc file for xcodec2

3760438

finish model doc for xcodec2

a2faa55

update doc

31319fb

working xcodec2

8d9f8df

add test file for xcodec2

e5a1838

Deep-unlearning marked this pull request as ready for review June 5, 2025 16:03

github-actions bot requested review from ArthurZucker and Rocketknight1 June 5, 2025 16:03

nit

473f95a

ArthurZucker added the New model label Jul 7, 2025

ArthurZucker removed the request for review from Rocketknight1 July 7, 2025 12:05

Modular cleanup.

af636f9

ebezzam reviewed Oct 2, 2025

View reviewed changes

ebezzam added 7 commits October 2, 2025 18:03

Remove processor.

fcb0ee5

Update docs.

e10b04a

Fix copy paths.

bc50545

Fix modeling tests for audio_spectrogram input.

5493319

Fix torch support with new spectrogram torch utility.

2908abd

Repo consistency.

cf9fd28

Feature extraction tests.

d97b48c

ebezzam reviewed Oct 3, 2025

View reviewed changes

ebezzam added 5 commits October 3, 2025 14:55

Merge branch 'main' into add-xcodec2

3c0a5c7

Make style happy

4d78a95

Remove unprotected import.

bd9f37f

Another unprotected import.

3d365e5

Remove more unprotected torches.

6b812a0

ebezzam reviewed Oct 3, 2025

View reviewed changes

zero_mean_unit_var_norm needed for a test

2bfc30a

ebezzam added 3 commits October 3, 2025 18:02

Modify Vocos component to be able to use modular later.

1911438

Update modular

dd3f45f

Make style happy.

8cfce62

ebezzam reviewed Oct 3, 2025

View reviewed changes

ebezzam mentioned this pull request Oct 3, 2025

Add Vocos model #39403

Open

		raise ValueError("Padding must be 'center' or 'same'.")


		class Xcodec2ISTFTHead(nn.Module):

Add xcodec2 model #37868

Are you sure you want to change the base?

Add xcodec2 model #37868

Conversation

Deep-unlearning commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Todo

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 29, 2025

Uh oh!

ArthurZucker commented Apr 30, 2025

Uh oh!

ebezzam Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

ebezzam Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam commented Oct 7, 2025

Uh oh!

github-actions bot commented Oct 7, 2025

Uh oh!

ryan-minato commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Deep-unlearning commented Apr 29, 2025 •

edited

Loading

ebezzam Oct 2, 2025 •

edited

Loading

ebezzam Oct 3, 2025 •

edited

Loading