Skip to content

Conversation

ebezzam
Copy link
Contributor

@ebezzam ebezzam commented Jul 9, 2025

What does this PR do?

Multiple things were wrong with the tests:

  • The expected outputs. I created this gist to reproduce new expected outputs (as not possible to reproduce previous ones): https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa

  • Hop length was incorrectly set on the Hub for 16kHz and 24kHz (UPDATE: corrected from 512 to 320 thanks to merged PR by Descript team). I’ve corrected in the conversion script for future use. Below are the test outputs when the Hop length is incorrect (3/6 tests fail):

    # RUN_SLOW=1 pytest tests/models/dac/test_modeling_dac.py::DacIntegrationTest
    tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_16khz FAILED                                                          [ 16%]
    tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_24khz PASSED                                                          [ 33%]
    tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_44khz PASSED                                                          [ 50%]
    tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_batch_16khz FAILED                                                    [ 66%]
    tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_batch_24khz FAILED                                                    [ 83%]
    tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_batch_44khz PASSED                                                    [100%]

Also I’ve standardized the tests (24kHz was testing something else) and added tests on quantizer and decoder outputs.

Note on high tolerances for encoder and decoder

Previous (and still now) the tests for the encoder outputs have a high tolerance (1e-3). With this script, I've verified that the weights have been mapped correctly (output snippet below).

Conv1 weight max diff: 5.96e-08

Block 0 weight differences:
  Block conv weight diff: 1.49e-08
  res_unit1.conv1 diff: 2.24e-08
  res_unit1.conv2 diff: 2.98e-08
  res_unit2.conv1 diff: 1.49e-08
  res_unit2.conv2 diff: 2.98e-08
  res_unit3.conv1 diff: 1.49e-08
  res_unit3.conv2 diff: 2.98e-08

Block 1 weight differences:
  Block conv weight diff: 1.49e-08
  res_unit1.conv1 diff: 1.49e-08
  res_unit1.conv2 diff: 2.98e-08
  res_unit2.conv1 diff: 2.24e-08
  res_unit2.conv2 diff: 5.96e-08
  res_unit3.conv1 diff: 1.49e-08
  res_unit3.conv2 diff: 2.98e-08

Block 2 weight differences:
  Block conv weight diff: 1.49e-08
  res_unit1.conv1 diff: 2.98e-08
  res_unit1.conv2 diff: 2.98e-08
  res_unit2.conv1 diff: 2.98e-08
  res_unit2.conv2 diff: 5.96e-08
  res_unit3.conv1 diff: 1.49e-08
  res_unit3.conv2 diff: 2.98e-08

Block 3 weight differences:
  Block conv weight diff: 2.24e-08
  res_unit1.conv1 diff: 2.24e-08
  res_unit1.conv2 diff: 2.98e-08
  res_unit2.conv1 diff: 2.24e-08
  res_unit2.conv2 diff: 2.98e-08
  res_unit3.conv1 diff: 4.47e-08
  res_unit3.conv2 diff: 2.98e-08

Snake1 alpha diff: 0.00e+00
Conv2 weight diff: 1.49e-08

However, error exponentially increases through encoder and decoder layers.

From my understanding, it is because the Transformers version of DAC does NOT have weight normalization in its architecture, while the Original version does (see model addition PR for discussion as to why there is no weight normalization in the Transformers version). This causes small differences between expected outputs at each layer, which get larger and larger tensors go deeper in the network.

Below is output snippet of error propagation through the encoder for the 44.1kHz model, calculated with the same script.

=== ENCODER ERROR PROPAGATION ANALYSIS ===
Layer                Max Error       Mean Error      Error Growth   
----------------------------------------------------------------------
Input                0.00e+00        0.00e+00        1.0x           
Conv1                1.19e-07        1.32e-09        infx           
Block0               2.26e-04        3.54e-06        1897.0x        

    --- Block 0 Internal Analysis ---
      res_unit1: 1.45e-05 (121.9x)
      res_unit2: 5.70e-05 (478.0x)
      res_unit3: 2.00e-04 (1680.0x)
      Final layers: 5.57e-04 (4670.2x)
Block1               5.99e-03        1.17e-04        26.5x          

    --- Block 1 Internal Analysis ---
      res_unit1: 1.43e-03 (6.3x)
      res_unit2: 1.65e-03 (7.3x)
      res_unit3: 6.91e-03 (30.6x)
      Final layers: 1.17e-02 (51.6x)
Block2               1.61e-02        1.82e-04        2.7x           

    --- Block 2 Internal Analysis ---
      res_unit1: 1.99e-02 (3.3x)
      res_unit2: 3.75e-02 (6.3x)
      res_unit3: 5.91e-02 (9.9x)
      Final layers: 3.76e-02 (6.3x)
Block3               3.64e-02        8.27e-04        2.3x           

    --- Block 3 Internal Analysis ---
      res_unit1: 1.61e-02 (1.0x)
      res_unit2: 3.24e-02 (2.0x)
      res_unit3: 8.89e-02 (5.5x)
      Final layers: 2.32e-01 (14.4x)
Snake1               7.28e-02        7.92e-04        2.0x           
Conv2                2.75e-02        9.82e-04        0.4x           

=== ERROR PROPAGATION SUMMARY ===
Initial weight error: 1.19e-07
Final encoder error: 2.75e-02
Total error amplification: 230790x

Top 3 error amplifiers:
  1. Block0: 1897.0x amplification
  2. Block1: 26.5x amplification
  3. Block2: 2.7x amplification
  • Conv1 already has weight normalization in original model, and we see a minimal error (precision-limited).
  • Block0 has 7x layers with weight norm (see original model), and that's where we get the big jump in deviation with the Transformer model -- 1897x.

We also have to keep decoder test tolerances quite high for the same reasons, propagation of weight normalization error.

Fortunately we can still use Transformers version as a valid approximation

Because:

  • quantizer is not affected by this precision error thanks to discretization, allowing us to keep tolerances in the tests at 1e-6
  • the error due the codec itself, is very similar between both approaches (see gist), allowing us keep tolerances at 1e-6 for the codec error

@eustlb eustlb self-requested a review July 9, 2025 15:59
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vasqu
Copy link
Contributor

vasqu commented Jul 9, 2025

Wow, very nice and much needed :D

Just a general question regarding batching:

  • Did you only cross-check against the original implementation?
  • During Dia integration, I noticed that batching is indeed not producing equivalent results when going single sample vs batch (by checking if the padded sample up to valid values produces equivalent ish values) - this might be a bug in the original code base if they now match

Regarding the code_quality CI, make style should fix it.

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial thoughts on my side

@ebezzam
Copy link
Contributor Author

ebezzam commented Jul 9, 2025

@vasqu thanks for the quick comments!

here's the error I get with checkpoint conversion, when uncommenting weight_norm = nn.utils.parametrizations.weight_norm

# COMMANDpython src/transformers/models/dac/convert_dac_checkpoint.py \
#> --model dac_16khz --sample_rate 16000 \
#> --pytorch_dump_folder_path dac_16k_local \
#> --checkpoint_path ~/.cache/descript/dac/weights_16khz_8kbps_0.0.5.pth

encoder.conv1.bias was initialized from encoder.block.0.bias.
Traceback (most recent call last):
  File "/home/eric_bezzam/transformers/src/transformers/models/dac/convert_dac_checkpoint.py", line 249, in <module>
    convert_checkpoint(
  File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/eric_bezzam/transformers/src/transformers/models/dac/convert_dac_checkpoint.py", line 220, in convert_checkpoint
    recursively_load_weights(original_checkpoint, model, model_name)
  File "/home/eric_bezzam/transformers/src/transformers/models/dac/convert_dac_checkpoint.py", line 179, in recursively_load_weights
    set_recursively(hf_model, mapped_key, value, name, weight_type)
  File "/home/eric_bezzam/transformers/src/transformers/models/dac/convert_dac_checkpoint.py", line 105, in set_recursively
    hf_shape = getattr(hf_pointer, weight_type).shape
  File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
    raise AttributeError(
AttributeError: 'ParametrizedConv1d' object has no attribute 'weight_g'. Did you mean: 'weight'?

@vasqu
Copy link
Contributor

vasqu commented Jul 9, 2025

Seems to be related to #33275 and is possibly tied to the deprecation cycle(?) 👀

Edit: #36393 is probably a proper fix

@ebezzam
Copy link
Contributor Author

ebezzam commented Jul 10, 2025

Seems to be related to #33275 and is possibly tied to the deprecation cycle(?) 👀

Edit: #36393 is probably a proper fix

@vasqu thanks for linking the latter PR, they seem to have had the same issue with DAC conversion.

And their fix is similar to mine (not using nn.utils.parametrizations.weight_norm). They rather:

Which approach is better?

  1. New apply_weight_norm within the conversion script (their PR)
  2. Removing nn.utils.parametrizations.weight_norm from DacPreTrainedModel.apply_weight_norm (this PR)

Copy link
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @ebezzam, as @vasqu mentioned, this was much needed.
Left comments, but mainly I feel like these integration tests were poorly designed in the beginning. It might be naïve, but I just don't get why doing means on the outputs!
Can we look in redesigning them to that we do

EXPECTED_OUTPUTS = torch.tensor([[...]])) 

and torch.allclose directly on the outputted tensors (so that we also catch shape mismatches!)
(this will very likely give different outputs on diff GPUs, hence the importance of the reproducers)

Moreover, (as pointed out by @vasqu), #36393 looks like a correct fix to the depecrated weight_norm issue. Therefore I'll leave you review 😉

@eustlb
Copy link
Contributor

eustlb commented Jul 10, 2025

Another comment about batch integration:

During Dia integration, I noticed that batching is indeed not producing equivalent results when going single sample vs batch (by checking if the padded sample up to valid values produces equivalent ish values) - this might be a bug in the original code base if they now match

@vasqu I think this is due to the fact that (as discussed offline) conv networks with bias add bias to padding values and the errors propagate. I originally wanted to mask convs outs on padding value to reduce the impact as much as possible (and ensure equivalence between batched and single sample) but it looks like it is something that is not really impacting outputs, so we can likely skip doing this, especially if it is not done in the original codebase for their batch inference.

@vasqu
Copy link
Contributor

vasqu commented Jul 10, 2025

My problem is that batched inference might be really broken, not in a way that the errors on the padded values accumulate (I agree we can ignore this) but that also the original values are affected by the padding. I had a script where I tested two sample batch vs single each and compared the values (non-padding values) and it failed to match on the shorter sequence.

I'm unsure if the original codebase is correct here tbh and during discussions with people from Dia they also said that batching with Dac didn't work as well. My concern here is that there might be a pretty big bug somewhere which we would ignore. However, this should not be part of this PR - I commented before looking into the code at that time :D

@ebezzam
Copy link
Contributor Author

ebezzam commented Jul 11, 2025

@eustlb thanks for the comments!

  • I've updated the tests (from means) to actual outputs, i.e. comparing with EXPECTED_X.
  • I compared to up to 15 codes per codebook (otherwise EXPECTED_X are very big), but compare to 50 outputs from the decoded output.
  • I looked into large encoder errors, and quite sure it is compute-related. I did a deep dive layer-by-layer checking outputs and weight values with this script. Weights are numerically exact at all layers, but errors accumulates over the encoder layers. I think (+Claude insight) because of precision error from weight normalization + accumulation over layers due downsampling operations. What's reassuring is that the quantizer discrete codes for the same input are very close (discretization adds robustness to the precision errors, which then don't accumulate as much). Moreover, the codec errors are very similar.

@vasqu re your original question (cross-check against the original implementation?) yes all checks were against original DAC model and codebase

@ebezzam
Copy link
Contributor Author

ebezzam commented Jul 11, 2025

I had a script where I tested two sample batch vs single each and compared the values (non-padding values) and it failed to match on the shorter sequence.

@vasqu is this a test we want to add? here or in another PR?

@vasqu
Copy link
Contributor

vasqu commented Jul 14, 2025

@ebezzam I think this should be another PR and would need to take another look. I suspect that the original codebase (and subsequently ours) doesn't handle batching as it should and it leads to differences in inference when single sample vs batch, i.e. pseudo code

audio_1
audio_2
audio_batch = [audio_1, audio_2]

single_sample_1 = dac(audio_1)
single_sample_2 = dac(audio_2)
batch = dac(audio_batch)

assertClose(single_sample_1, audio_batch[0, : len(single_sample_1)])
assertClose(single_sample_2, audio_batch[1, : len(single_sample_2)])

Hence, batching is probably still broken.

@ebezzam ebezzam added the Audio label Jul 22, 2025
@ebezzam ebezzam requested a review from eustlb July 22, 2025 13:47
Comment on lines 401 to 412
Code for reproducing expected outputs can be found here:
- Single file: https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa#file-dac_integration_single-py
- Batched: https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa#file-dac_integration-py
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
See https://github.com/huggingface/transformers/pull/39313 for reason behind large tolerance between for encoder
and decoder outputs (1e-3). In summary, original model uses weight normalization, while Transformers does not. This
leads to accumulating error. However, this does not affect the quantizer codes, thanks to discretization being
robust to precision errors. Moreover, codec error is similar between Transformers and original.
Moreover, here is a script to debug outputs and weights layer-by-layer:
https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa#file-dac_layer_by_layer_debugging-py
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

@ebezzam
Copy link
Contributor Author

ebezzam commented Jul 23, 2025

run-slow: dac

@eustlb
Copy link
Contributor

eustlb commented Jul 23, 2025

run-slow: dac

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/dac']
quantizations: [] ...

@eustlb
Copy link
Contributor

eustlb commented Jul 23, 2025

run-slow: dac

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/dac']
quantizations: [] ...

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: dac

@ebezzam ebezzam merged commit 7a4e2e7 into huggingface:main Jul 23, 2025
19 checks passed
@ebezzam ebezzam deleted the dac_fix branch July 23, 2025 17:21
This was referenced Jul 25, 2025
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* Fix DAC (slow) integration tests.

* Fix DAC conversion.

* Address comments

* Sync with main, uncomment nn.utils.parametrizations.weight_norm.

* Update DAC integration tests with expected outputs.

* Added info about encoder/decoder error and longer decoder outputs.

* Parameterize tests.

* Set expected values to GitHub runners.
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* Fix DAC (slow) integration tests.

* Fix DAC conversion.

* Address comments

* Sync with main, uncomment nn.utils.parametrizations.weight_norm.

* Update DAC integration tests with expected outputs.

* Added info about encoder/decoder error and longer decoder outputs.

* Parameterize tests.

* Set expected values to GitHub runners.
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* Fix DAC (slow) integration tests.

* Fix DAC conversion.

* Address comments

* Sync with main, uncomment nn.utils.parametrizations.weight_norm.

* Update DAC integration tests with expected outputs.

* Added info about encoder/decoder error and longer decoder outputs.

* Parameterize tests.

* Set expected values to GitHub runners.
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* Fix DAC (slow) integration tests.

* Fix DAC conversion.

* Address comments

* Sync with main, uncomment nn.utils.parametrizations.weight_norm.

* Update DAC integration tests with expected outputs.

* Added info about encoder/decoder error and longer decoder outputs.

* Parameterize tests.

* Set expected values to GitHub runners.
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* Fix DAC (slow) integration tests.

* Fix DAC conversion.

* Address comments

* Sync with main, uncomment nn.utils.parametrizations.weight_norm.

* Update DAC integration tests with expected outputs.

* Added info about encoder/decoder error and longer decoder outputs.

* Parameterize tests.

* Set expected values to GitHub runners.
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* Fix DAC (slow) integration tests.

* Fix DAC conversion.

* Address comments

* Sync with main, uncomment nn.utils.parametrizations.weight_norm.

* Update DAC integration tests with expected outputs.

* Added info about encoder/decoder error and longer decoder outputs.

* Parameterize tests.

* Set expected values to GitHub runners.
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* Fix DAC (slow) integration tests.

* Fix DAC conversion.

* Address comments

* Sync with main, uncomment nn.utils.parametrizations.weight_norm.

* Update DAC integration tests with expected outputs.

* Added info about encoder/decoder error and longer decoder outputs.

* Parameterize tests.

* Set expected values to GitHub runners.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants