-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Fix DAC integration tests and checkpoint conversion. #39313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Wow, very nice and much needed :D Just a general question regarding batching:
Regarding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some initial thoughts on my side
@vasqu thanks for the quick comments! here's the error I get with checkpoint conversion, when uncommenting
|
@vasqu thanks for linking the latter PR, they seem to have had the same issue with DAC conversion. And their fix is similar to mine (not using
Which approach is better?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @ebezzam, as @vasqu mentioned, this was much needed.
Left comments, but mainly I feel like these integration tests were poorly designed in the beginning. It might be naïve, but I just don't get why doing means on the outputs!
Can we look in redesigning them to that we do
EXPECTED_OUTPUTS = torch.tensor([[...]]))
and torch.allclose
directly on the outputted tensors (so that we also catch shape mismatches!)
(this will very likely give different outputs on diff GPUs, hence the importance of the reproducers)
Moreover, (as pointed out by @vasqu), #36393 looks like a correct fix to the depecrated weight_norm
issue. Therefore I'll leave you review 😉
Another comment about batch integration:
@vasqu I think this is due to the fact that (as discussed offline) conv networks with bias add bias to padding values and the errors propagate. I originally wanted to mask convs outs on padding value to reduce the impact as much as possible (and ensure equivalence between batched and single sample) but it looks like it is something that is not really impacting outputs, so we can likely skip doing this, especially if it is not done in the original codebase for their batch inference. |
My problem is that batched inference might be really broken, not in a way that the errors on the padded values accumulate (I agree we can ignore this) but that also the original values are affected by the padding. I had a script where I tested two sample batch vs single each and compared the values (non-padding values) and it failed to match on the shorter sequence. I'm unsure if the original codebase is correct here tbh and during discussions with people from Dia they also said that batching with Dac didn't work as well. My concern here is that there might be a pretty big bug somewhere which we would ignore. However, this should not be part of this PR - I commented before looking into the code at that time :D |
@eustlb thanks for the comments!
@vasqu re your original question (cross-check against the original implementation?) yes all checks were against original DAC model and codebase |
@vasqu is this a test we want to add? here or in another PR? |
@ebezzam I think this should be another PR and would need to take another look. I suspect that the original codebase (and subsequently ours) doesn't handle batching as it should and it leads to differences in inference when single sample vs batch, i.e. pseudo code
Hence, batching is probably still broken. |
Code for reproducing expected outputs can be found here: | ||
- Single file: https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa#file-dac_integration_single-py | ||
- Batched: https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa#file-dac_integration-py | ||
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") | ||
See https://github.com/huggingface/transformers/pull/39313 for reason behind large tolerance between for encoder | ||
and decoder outputs (1e-3). In summary, original model uses weight normalization, while Transformers does not. This | ||
leads to accumulating error. However, this does not affect the quantizer codes, thanks to discretization being | ||
robust to precision errors. Moreover, codec error is similar between Transformers and original. | ||
Moreover, here is a script to debug outputs and weights layer-by-layer: | ||
https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa#file-dac_layer_by_layer_debugging-py | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
run-slow: dac |
run-slow: dac |
This comment contains run-slow, running the specified jobs: models: ['models/dac'] |
run-slow: dac |
This comment contains run-slow, running the specified jobs: models: ['models/dac'] |
[For maintainers] Suggested jobs to run (before merge) run-slow: dac |
* Fix DAC (slow) integration tests. * Fix DAC conversion. * Address comments * Sync with main, uncomment nn.utils.parametrizations.weight_norm. * Update DAC integration tests with expected outputs. * Added info about encoder/decoder error and longer decoder outputs. * Parameterize tests. * Set expected values to GitHub runners.
* Fix DAC (slow) integration tests. * Fix DAC conversion. * Address comments * Sync with main, uncomment nn.utils.parametrizations.weight_norm. * Update DAC integration tests with expected outputs. * Added info about encoder/decoder error and longer decoder outputs. * Parameterize tests. * Set expected values to GitHub runners.
* Fix DAC (slow) integration tests. * Fix DAC conversion. * Address comments * Sync with main, uncomment nn.utils.parametrizations.weight_norm. * Update DAC integration tests with expected outputs. * Added info about encoder/decoder error and longer decoder outputs. * Parameterize tests. * Set expected values to GitHub runners.
* Fix DAC (slow) integration tests. * Fix DAC conversion. * Address comments * Sync with main, uncomment nn.utils.parametrizations.weight_norm. * Update DAC integration tests with expected outputs. * Added info about encoder/decoder error and longer decoder outputs. * Parameterize tests. * Set expected values to GitHub runners.
* Fix DAC (slow) integration tests. * Fix DAC conversion. * Address comments * Sync with main, uncomment nn.utils.parametrizations.weight_norm. * Update DAC integration tests with expected outputs. * Added info about encoder/decoder error and longer decoder outputs. * Parameterize tests. * Set expected values to GitHub runners.
* Fix DAC (slow) integration tests. * Fix DAC conversion. * Address comments * Sync with main, uncomment nn.utils.parametrizations.weight_norm. * Update DAC integration tests with expected outputs. * Added info about encoder/decoder error and longer decoder outputs. * Parameterize tests. * Set expected values to GitHub runners.
* Fix DAC (slow) integration tests. * Fix DAC conversion. * Address comments * Sync with main, uncomment nn.utils.parametrizations.weight_norm. * Update DAC integration tests with expected outputs. * Added info about encoder/decoder error and longer decoder outputs. * Parameterize tests. * Set expected values to GitHub runners.
What does this PR do?
Multiple things were wrong with the tests:
The expected outputs. I created this gist to reproduce new expected outputs (as not possible to reproduce previous ones): https://gist.github.com/ebezzam/bb315efa7a416db6336a6b2a2d424ffa
Hop length was incorrectly set on the Hub for 16kHz and 24kHz (UPDATE: corrected from 512 to 320 thanks to merged PR by Descript team). I’ve corrected in the conversion script for future use. Below are the test outputs when the Hop length is incorrect (3/6 tests fail):
# RUN_SLOW=1 pytest tests/models/dac/test_modeling_dac.py::DacIntegrationTest tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_16khz FAILED [ 16%] tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_24khz PASSED [ 33%] tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_44khz PASSED [ 50%] tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_batch_16khz FAILED [ 66%] tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_batch_24khz FAILED [ 83%] tests/models/dac/test_modeling_dac.py::DacIntegrationTest::test_integration_batch_44khz PASSED [100%]
Also I’ve standardized the tests (24kHz was testing something else) and added tests on quantizer and decoder outputs.
Note on high tolerances for encoder and decoder
Previous (and still now) the tests for the encoder outputs have a high tolerance (1e-3). With this script, I've verified that the weights have been mapped correctly (output snippet below).
However, error exponentially increases through encoder and decoder layers.
From my understanding, it is because the Transformers version of DAC does NOT have weight normalization in its architecture, while the Original version does (see model addition PR for discussion as to why there is no weight normalization in the Transformers version). This causes small differences between expected outputs at each layer, which get larger and larger tensors go deeper in the network.
Below is output snippet of error propagation through the encoder for the 44.1kHz model, calculated with the same script.
Conv1
already has weight normalization in original model, and we see a minimal error (precision-limited).Block0
has 7x layers with weight norm (see original model), and that's where we get the big jump in deviation with the Transformer model -- 1897x.We also have to keep decoder test tolerances quite high for the same reasons, propagation of weight normalization error.
Fortunately we can still use Transformers version as a valid approximation
Because:
1e-6
1e-6
for the codec error