Fix DAC integration tests and checkpoint conversion. #39313

ebezzam · 2025-07-09T15:55:04Z

What does this PR do?

Multiple things were wrong with the tests:

The expected outputs. I created this gist to reproduce new expected outputs (as not possible to reproduce previous ones): https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa

Hop length was incorrectly set on the Hub for 16kHz and 24kHz (UPDATE: corrected from 512 to 320 thanks to merged PR by Descript team). I’ve corrected in the conversion script for future use. Below are the test outputs when the Hop length is incorrect (3/6 tests fail):

# RUN_SLOW=1 pytest tests/models/dac/test_modeling_dac.py::DacIntegrationTest
tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_16khz FAILED                                                          [ 16%]
tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_24khz PASSED                                                          [ 33%]
tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_44khz PASSED                                                          [ 50%]
tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_batch_16khz FAILED                                                    [ 66%]
tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_batch_24khz FAILED                                                    [ 83%]
tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_batch_44khz PASSED                                                    [100%]

Also I’ve standardized the tests (24kHz was testing something else) and added tests on quantizer and decoder outputs.

Note on high tolerances for encoder and decoder

Previous (and still now) the tests for the encoder outputs have a high tolerance (1e-3). With this script, I've verified that the weights have been mapped correctly (output snippet below).

Conv1 weight max diff: 5.96e-08

Block 0 weight differences:
  Block conv weight diff: 1.49e-08
  res_unit1.conv1 diff: 2.24e-08
  res_unit1.conv2 diff: 2.98e-08
  res_unit2.conv1 diff: 1.49e-08
  res_unit2.conv2 diff: 2.98e-08
  res_unit3.conv1 diff: 1.49e-08
  res_unit3.conv2 diff: 2.98e-08

Block 1 weight differences:
  Block conv weight diff: 1.49e-08
  res_unit1.conv1 diff: 1.49e-08
  res_unit1.conv2 diff: 2.98e-08
  res_unit2.conv1 diff: 2.24e-08
  res_unit2.conv2 diff: 5.96e-08
  res_unit3.conv1 diff: 1.49e-08
  res_unit3.conv2 diff: 2.98e-08

Block 2 weight differences:
  Block conv weight diff: 1.49e-08
  res_unit1.conv1 diff: 2.98e-08
  res_unit1.conv2 diff: 2.98e-08
  res_unit2.conv1 diff: 2.98e-08
  res_unit2.conv2 diff: 5.96e-08
  res_unit3.conv1 diff: 1.49e-08
  res_unit3.conv2 diff: 2.98e-08

Block 3 weight differences:
  Block conv weight diff: 2.24e-08
  res_unit1.conv1 diff: 2.24e-08
  res_unit1.conv2 diff: 2.98e-08
  res_unit2.conv1 diff: 2.24e-08
  res_unit2.conv2 diff: 2.98e-08
  res_unit3.conv1 diff: 4.47e-08
  res_unit3.conv2 diff: 2.98e-08

Snake1 alpha diff: 0.00e+00
Conv2 weight diff: 1.49e-08

However, error exponentially increases through encoder and decoder layers.

From my understanding, it is because the Transformers version of DAC does NOT have weight normalization in its architecture, while the Original version does (see model addition PR for discussion as to why there is no weight normalization in the Transformers version). This causes small differences between expected outputs at each layer, which get larger and larger tensors go deeper in the network.

Below is output snippet of error propagation through the encoder for the 44.1kHz model, calculated with the same script.

=== ENCODER ERROR PROPAGATION ANALYSIS ===
Layer                Max Error       Mean Error      Error Growth   
----------------------------------------------------------------------
Input                0.00e+00        0.00e+00        1.0x           
Conv1                1.19e-07        1.32e-09        infx           
Block0               2.26e-04        3.54e-06        1897.0x        

    --- Block 0 Internal Analysis ---
      res_unit1: 1.45e-05 (121.9x)
      res_unit2: 5.70e-05 (478.0x)
      res_unit3: 2.00e-04 (1680.0x)
      Final layers: 5.57e-04 (4670.2x)
Block1               5.99e-03        1.17e-04        26.5x          

    --- Block 1 Internal Analysis ---
      res_unit1: 1.43e-03 (6.3x)
      res_unit2: 1.65e-03 (7.3x)
      res_unit3: 6.91e-03 (30.6x)
      Final layers: 1.17e-02 (51.6x)
Block2               1.61e-02        1.82e-04        2.7x           

    --- Block 2 Internal Analysis ---
      res_unit1: 1.99e-02 (3.3x)
      res_unit2: 3.75e-02 (6.3x)
      res_unit3: 5.91e-02 (9.9x)
      Final layers: 3.76e-02 (6.3x)
Block3               3.64e-02        8.27e-04        2.3x           

    --- Block 3 Internal Analysis ---
      res_unit1: 1.61e-02 (1.0x)
      res_unit2: 3.24e-02 (2.0x)
      res_unit3: 8.89e-02 (5.5x)
      Final layers: 2.32e-01 (14.4x)
Snake1               7.28e-02        7.92e-04        2.0x           
Conv2                2.75e-02        9.82e-04        0.4x           

=== ERROR PROPAGATION SUMMARY ===
Initial weight error: 1.19e-07
Final encoder error: 2.75e-02
Total error amplification: 230790x

Top 3 error amplifiers:
  1. Block0: 1897.0x amplification
  2. Block1: 26.5x amplification
  3. Block2: 2.7x amplification

Conv1 already has weight normalization in original model, and we see a minimal error (precision-limited).
Block0 has 7x layers with weight norm (see original model), and that's where we get the big jump in deviation with the Transformer model -- 1897x.

We also have to keep decoder test tolerances quite high for the same reasons, propagation of weight normalization error.

Fortunately we can still use Transformers version as a valid approximation

Because:

quantizer is not affected by this precision error thanks to discretization, allowing us to keep tolerances in the tests at 1e-6
the error due the codec itself, is very similar between both approaches (see gist), allowing us keep tolerances at 1e-6 for the codec error

src/transformers/models/dac/modeling_dac.py

tests/models/dac/test_modeling_dac.py

HuggingFaceDocBuilderDev · 2025-07-09T16:08:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2025-07-09T16:12:19Z

Wow, very nice and much needed :D

Just a general question regarding batching:

Did you only cross-check against the original implementation?
During Dia integration, I noticed that batching is indeed not producing equivalent results when going single sample vs batch (by checking if the padded sample up to valid values produces equivalent ish values) - this might be a bug in the original code base if they now match

Regarding the code_quality CI, make style should fix it.

vasqu

Just some initial thoughts on my side

src/transformers/models/dac/modeling_dac.py

tests/models/dac/test_modeling_dac.py

ebezzam · 2025-07-09T17:49:28Z

@vasqu thanks for the quick comments!

here's the error I get with checkpoint conversion, when uncommenting weight_norm = nn.utils.parametrizations.weight_norm

# COMMANDpython src/transformers/models/dac/convert_dac_checkpoint.py \
#> --model dac_16khz --sample_rate 16000 \
#> --pytorch_dump_folder_path dac_16k_local \
#> --checkpoint_path ~/.cache/descript/dac/weights_16khz_8kbps_0.0.5.pth

encoder.conv1.bias was initialized from encoder.block.0.bias.
Traceback (most recent call last):
  File "/home/eric_bezzam/transformers/src/transformers/models/dac/convert_dac_checkpoint.py", line 249, in <module>
    convert_checkpoint(
  File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/eric_bezzam/transformers/src/transformers/models/dac/convert_dac_checkpoint.py", line 220, in convert_checkpoint
    recursively_load_weights(original_checkpoint, model, model_name)
  File "/home/eric_bezzam/transformers/src/transformers/models/dac/convert_dac_checkpoint.py", line 179, in recursively_load_weights
    set_recursively(hf_model, mapped_key, value, name, weight_type)
  File "/home/eric_bezzam/transformers/src/transformers/models/dac/convert_dac_checkpoint.py", line 105, in set_recursively
    hf_shape = getattr(hf_pointer, weight_type).shape
  File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
    raise AttributeError(
AttributeError: 'ParametrizedConv1d' object has no attribute 'weight_g'. Did you mean: 'weight'?

vasqu · 2025-07-09T17:58:18Z

Seems to be related to #33275 and is possibly tied to the deprecation cycle(?) 👀

Edit: #36393 is probably a proper fix

ebezzam · 2025-07-10T08:31:43Z

Seems to be related to #33275 and is possibly tied to the deprecation cycle(?) 👀

Edit: #36393 is probably a proper fix

@vasqu thanks for linking the latter PR, they seem to have had the same issue with DAC conversion.

And their fix is similar to mine (not using nn.utils.parametrizations.weight_norm). They rather:

copied DacPreTrainedModel.apply_weight_norm to the conversion script
removed nn.utils.parametrizations.weight_norm
call their new apply_weight_norm when converting (see their changed file).

Which approach is better?

New apply_weight_norm within the conversion script (their PR)
Removing nn.utils.parametrizations.weight_norm from DacPreTrainedModel.apply_weight_norm (this PR)

eustlb

Thanks a lot @ebezzam, as @vasqu mentioned, this was much needed.
Left comments, but mainly I feel like these integration tests were poorly designed in the beginning. It might be naïve, but I just don't get why doing means on the outputs!
Can we look in redesigning them to that we do

EXPECTED_OUTPUTS = torch.tensor([[...]]))

and torch.allclose directly on the outputted tensors (so that we also catch shape mismatches!)
(this will very likely give different outputs on diff GPUs, hence the importance of the reproducers)

Moreover, (as pointed out by @vasqu), #36393 looks like a correct fix to the depecrated weight_norm issue. Therefore I'll leave you review 😉

tests/models/dac/test_modeling_dac.py

eustlb · 2025-07-10T09:34:42Z

Another comment about batch integration:

During Dia integration, I noticed that batching is indeed not producing equivalent results when going single sample vs batch (by checking if the padded sample up to valid values produces equivalent ish values) - this might be a bug in the original code base if they now match

@vasqu I think this is due to the fact that (as discussed offline) conv networks with bias add bias to padding values and the errors propagate. I originally wanted to mask convs outs on padding value to reduce the impact as much as possible (and ensure equivalence between batched and single sample) but it looks like it is something that is not really impacting outputs, so we can likely skip doing this, especially if it is not done in the original codebase for their batch inference.

vasqu · 2025-07-10T10:33:27Z

My problem is that batched inference might be really broken, not in a way that the errors on the padded values accumulate (I agree we can ignore this) but that also the original values are affected by the padding. I had a script where I tested two sample batch vs single each and compared the values (non-padding values) and it failed to match on the shorter sequence.

I'm unsure if the original codebase is correct here tbh and during discussions with people from Dia they also said that batching with Dac didn't work as well. My concern here is that there might be a pretty big bug somewhere which we would ignore. However, this should not be part of this PR - I commented before looking into the code at that time :D

ebezzam · 2025-07-11T15:26:05Z

@eustlb thanks for the comments!

I've updated the tests (from means) to actual outputs, i.e. comparing with EXPECTED_X.
I compared to up to 15 codes per codebook (otherwise EXPECTED_X are very big), but compare to 50 outputs from the decoded output.
I looked into large encoder errors, and quite sure it is compute-related. I did a deep dive layer-by-layer checking outputs and weight values with this script. Weights are numerically exact at all layers, but errors accumulates over the encoder layers. I think (+Claude insight) because of precision error from weight normalization + accumulation over layers due downsampling operations. What's reassuring is that the quantizer discrete codes for the same input are very close (discretization adds robustness to the precision errors, which then don't accumulate as much). Moreover, the codec errors are very similar.

@vasqu re your original question (cross-check against the original implementation?) yes all checks were against original DAC model and codebase

ebezzam · 2025-07-11T15:26:43Z

I had a script where I tested two sample batch vs single each and compared the values (non-padding values) and it failed to match on the shorter sequence.

@vasqu is this a test we want to add? here or in another PR?

vasqu · 2025-07-14T10:29:10Z

@ebezzam I think this should be another PR and would need to take another look. I suspect that the original codebase (and subsequently ours) doesn't handle batching as it should and it leads to differences in inference when single sample vs batch, i.e. pseudo code

audio_1
audio_2
audio_batch = [audio_1, audio_2]

single_sample_1 = dac(audio_1)
single_sample_2 = dac(audio_2)
batch = dac(audio_batch)

assertClose(single_sample_1, audio_batch[0, : len(single_sample_1)])
assertClose(single_sample_2, audio_batch[1, : len(single_sample_2)])

Hence, batching is probably still broken.

eustlb · 2025-07-23T09:30:54Z

tests/models/dac/test_modeling_dac.py

+    Code for reproducing expected outputs can be found here:
+    - Single file: https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa#file-dac_integration_single-py
+    - Batched: https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa#file-dac_integration-py

-        librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+    See https://github.com/huggingface/transformers/pull/39313 for reason behind large tolerance between for encoder
+    and decoder outputs (1e-3). In summary, original model uses weight normalization, while Transformers does not. This
+    leads to accumulating error. However, this does not affect the quantizer codes, thanks to discretization being
+    robust to precision errors. Moreover, codec error is similar between Transformers and original.
+
+    Moreover, here is a script to debug outputs and weights layer-by-layer:
+    https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa#file-dac_layer_by_layer_debugging-py
+    """


ebezzam · 2025-07-23T09:44:37Z

run-slow: dac

eustlb · 2025-07-23T11:24:22Z

run-slow: dac

github-actions · 2025-07-23T11:25:53Z

This comment contains run-slow, running the specified jobs:

models: ['models/dac']
quantizations: [] ...

eustlb · 2025-07-23T16:39:11Z

run-slow: dac

github-actions · 2025-07-23T16:40:38Z

This comment contains run-slow, running the specified jobs:

models: ['models/dac']
quantizations: [] ...

github-actions · 2025-07-23T17:08:32Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: dac

* Fix DAC (slow) integration tests. * Fix DAC conversion. * Address comments * Sync with main, uncomment nn.utils.parametrizations.weight_norm. * Update DAC integration tests with expected outputs. * Added info about encoder/decoder error and longer decoder outputs. * Parameterize tests. * Set expected values to GitHub runners.

ebezzam added 3 commits July 9, 2025 17:50

Fix DAC (slow) integration tests.

4cce31f

Fix DAC conversion.

716baa6

Merge branch 'main' into dac_fix

086f6b0

ebezzam commented Jul 9, 2025

View reviewed changes

src/transformers/models/dac/modeling_dac.py Outdated Show resolved Hide resolved

ebezzam commented Jul 9, 2025

View reviewed changes

tests/models/dac/test_modeling_dac.py Outdated Show resolved Hide resolved

eustlb self-requested a review July 9, 2025 15:59

ebezzam commented Jul 9, 2025

View reviewed changes

tests/models/dac/test_modeling_dac.py Outdated Show resolved Hide resolved

vasqu reviewed Jul 9, 2025

View reviewed changes

Address comments

9e51f6f

eustlb reviewed Jul 10, 2025

View reviewed changes

vasqu mentioned this pull request Jul 10, 2025

Handle DAC conversion when using weight_norm with newer PyTorch versions #36393

Merged

ebezzam added 4 commits July 10, 2025 15:19

Merge branch 'main' into dac_fix

0addb3e

Sync with main, uncomment nn.utils.parametrizations.weight_norm.

e5f02a2

Update DAC integration tests with expected outputs.

178c4d8

Merge branch 'main' into dac_fix

60004f5

ebezzam added the Audio label Jul 22, 2025

Added info about encoder/decoder error and longer decoder outputs.

da8243b

ebezzam requested a review from eustlb July 22, 2025 13:47

eustlb approved these changes Jul 23, 2025

View reviewed changes

ebezzam added 2 commits July 23, 2025 18:00

Parameterize tests.

36a24cb

Set expected values to GitHub runners.

7d27ea1

Merge branch 'main' into dac_fix

8eb8d8e

ebezzam merged commit 7a4e2e7 into huggingface:main Jul 23, 2025
19 checks passed

ebezzam deleted the dac_fix branch July 23, 2025 17:21

This was referenced Jul 25, 2025

Add xcodec2 model #37868

Open

Fix DAC conversion script #39793

Open

Fix DAC integration tests and checkpoint conversion. #39313

Fix DAC integration tests and checkpoint conversion. #39313

Uh oh!

Conversation

ebezzam commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Note on high tolerances for encoder and decoder

Fortunately we can still use Transformers version as a valid approximation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 9, 2025

Uh oh!

vasqu commented Jul 9, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebezzam commented Jul 9, 2025

Uh oh!

vasqu commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebezzam commented Jul 10, 2025

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eustlb commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Jul 10, 2025

Uh oh!

ebezzam commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebezzam commented Jul 11, 2025

Uh oh!

vasqu commented Jul 14, 2025

Uh oh!

eustlb Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam commented Jul 23, 2025

Uh oh!

eustlb commented Jul 23, 2025

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

eustlb commented Jul 23, 2025

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ebezzam commented Jul 9, 2025 •

edited

Loading

vasqu commented Jul 9, 2025 •

edited

Loading

eustlb commented Jul 10, 2025 •

edited

Loading

ebezzam commented Jul 11, 2025 •

edited

Loading