Skip to content

why CTCDecodingConfig dont work? #13155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rastinrastinii opened this issue Apr 22, 2025 · 7 comments
Closed

why CTCDecodingConfig dont work? #13155

rastinrastinii opened this issue Apr 22, 2025 · 7 comments
Assignees

Comments

@rastinrastinii
Copy link

Hi,
i dont understand why applying following ctc decoding config or dont applying it while only setting model.cfg.decoding.preserve_alignments = True and model.cfg.decoding.compute_timestamps = True return same output while expect better output with applying language model.

transcribe code

result= model.transcribe([audio], batch_size=1)[0]

CTCDecoding code

decoding_cfg = CTCDecodingConfig()

decoding_cfg.strategy = "flashlight"
decoding_cfg.beam.search_type = "flashlight"
decoding_cfg.beam.kenlm_path = f'kenlm_3.arpa'
decoding_cfg.beam.flashlight_cfg.lexicon_path=f'/mnt/cephfs/mshahsavari/data/voice_assistant_data/exp_dir/FastConformer-Transducer-BPE/2025-03-25_19-18-28/checkpoints/kenlm_3.lexicon'
decoding_cfg.beam.beam_size = 32
decoding_cfg.beam.beam_alpha = 0.2
decoding_cfg.beam.beam_beta = 0.2
decoding_cfg.beam.flashlight_cfg.beam_size_token = 32
decoding_cfg.beam.flashlight_cfg.beam_threshold = 25.0
decoding_cfg.beam.preserve_alignments = True
decoding_cfg.beam.compute_timestamps = True
model._timestamp_enabled = True

model.change_decoding_strategy(decoding_cfg)

model config

cfg:
  sample_rate: 16000
  log_prediction: true
  ctc_reduction: mean_volume
  skip_nan_grad: false
  model_defaults:
    enc_hidden: 1024
    pred_hidden: 640
    joint_hidden: 640
  train_ds:
    manifest_filepath: /home/mshahsavari/projects/recitation_verse_alignment/data/processed/train/metadata_train.json
    sample_rate: 16000
    batch_size: 8
    shuffle: true
    num_workers: 8
    pin_memory: true
    max_duration: 120
    min_duration: 0.1
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    bucketing_strategy: fully_randomized
    bucketing_batch_size: null
  validation_ds:
    manifest_filepath: /home/mshahsavari/projects/recitation_verse_alignment/data/processed/validation/metadata_validation.json
    sample_rate: 16000
    batch_size: 8
    shuffle: false
    use_start_end_token: false
    num_workers: 8
    pin_memory: true
  test_ds:
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 16
    shuffle: false
    use_start_end_token: false
    num_workers: 8
    pin_memory: true
  tokenizer:
    dir: /home/mshahsavari/projects/recitation_verse_alignment/models/tokenizers/sentencepiece/tokenizer_spe_bpe_v128
    type: bpe
    model_path: /home/mshahsavari/projects/recitation_verse_alignment/models/tokenizers/sentencepiece/tokenizer_spe_bpe_v128/tokenizer.model
    vocab_path: /home/mshahsavari/projects/recitation_verse_alignment/models/tokenizers/sentencepiece/tokenizer_spe_bpe_v128/vocab.txt
    spe_tokenizer_vocab: /home/mshahsavari/projects/recitation_verse_alignment/models/tokenizers/sentencepiece/tokenizer_spe_bpe_v128/tokenizer.vocab
  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    sample_rate: 16000
    normalize: per_feature
    window_size: 0.025
    window_stride: 0.01
    window: hann
    features: 80
    n_fft: 512
    log: true
    frame_splicing: 1
    dither: 1.0e-05
    pad_to: 0
    pad_value: 0.0
  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    freq_masks: 2
    time_masks: 10
    freq_width: 27
    time_width: 0.05
  encoder:
    _target_: nemo.collections.asr.modules.ConformerEncoder
    feat_in: 80
    feat_out: -1
    n_layers: 24
    d_model: 1024
    subsampling: dw_striding
    subsampling_factor: 8
    subsampling_conv_channels: 256
    causal_downsampling: false
    ff_expansion_factor: 4
    self_attention_model: rel_pos
    n_heads: 8
    att_context_size:
    - -1
    - -1
    att_context_style: regular
    xscaling: true
    untie_biases: true
    pos_emb_max_len: 5000
    conv_kernel_size: 9
    conv_norm_type: batch_norm
    conv_context_size: null
    dropout: 0.1
    dropout_pre_encoder: 0.1
    dropout_emb: 0.0
    dropout_att: 0.1
    stochastic_depth_drop_prob: 0.0
    stochastic_depth_mode: linear
    stochastic_depth_start_layer: 1
  decoder:
    _target_: nemo.collections.asr.modules.ConvASRDecoder
    feat_in: 1024
    num_classes: 128
    vocabulary:
    - <unk>
    - ▁ا
    - ▁ال
    - ▁و
    - ▁م
    .
    .
    .
  interctc:
    loss_weights: []
    apply_at_layers: []
  optim:
    name: adamw
    lr: 0.001
    betas:
    - 0.9
    - 0.98
    weight_decay: 0.001
    sched:
      name: CosineAnnealing
      warmup_steps: 15000
      warmup_ratio: null
      min_lr: 0.0001
  target: nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE
  nemo_version: 1.19.0rc0
  decoding:
    strategy: greedy
    preserve_alignments: null
    compute_timestamps: null
    word_seperator: ' '
    segment_seperators:
    - .
    - '!'
    - '?'
    segment_gap_threshold: null
    ctc_timestamp_type: all
    batch_dim_index: 0
    greedy:
      preserve_alignments: false
      compute_timestamps: false
      preserve_frame_confidence: false
      confidence_method_cfg: null
    beam:
      beam_size: 4
      search_type: default
      preserve_alignments: false
      compute_timestamps: false
      return_best_hypothesis: true
      beam_alpha: 1.0
      beam_beta: 0.0
      kenlm_path: null
      flashlight_cfg:
        lexicon_path: null
        boost_path: null
        beam_size_token: 16
        beam_threshold: 20.0
        unk_weight: -.inf
        sil_weight: 0.0
      pyctcdecode_cfg:
        beam_prune_logp: -10.0
        token_min_logp: -5.0
        prune_history: false
        hotwords: null
        hotword_weight: 10.0
    wfst:
      beam_size: 4
      search_type: riva
      return_best_hypothesis: true
      preserve_alignments: false
      compute_timestamps: false
      decoding_mode: nbest
      open_vocabulary_decoding: false
      beam_width: 10.0
      lm_weight: 1.0
      device: cuda
      arpa_lm_path: null
      wfst_lm_path: null
      riva_decoding_cfg: {}
      k2_decoding_cfg:
        search_beam: 20.0
        output_beam: 10.0
        min_active_states: 30
        max_active_states: 10000
    confidence_cfg:
      preserve_frame_confidence: false
      preserve_token_confidence: false
      preserve_word_confidence: false
      exclude_blank: true
      aggregation: min
      tdt_include_duration: false
      method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: 0.33
    temperature: 1.0

after change decoding

[NeMo I 2025-04-22 10:59:53 ctc_bpe_models:357] Changed decoding strategy to 
    strategy: flashlight
    preserve_alignments: null
    compute_timestamps: null
    word_seperator: ' '
    segment_seperators:
    - .
    - '!'
    - '?'
    segment_gap_threshold: null
    ctc_timestamp_type: all
    batch_dim_index: 0
    greedy:
      preserve_alignments: false
      compute_timestamps: false
      preserve_frame_confidence: false
      confidence_method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
    beam:
      beam_size: 32
      search_type: flashlight
      preserve_alignments: true
      compute_timestamps: true
      return_best_hypothesis: true
      beam_alpha: 0.2
      beam_beta: 0.2
      kenlm_path: /home/mshahsavari/projects/recitation_verse_alignment/models/language_models/our_kenlm_generated/quran_kenlm_3.arpa
      flashlight_cfg:
        lexicon_path: /mnt/cephfs/mshahsavari/data/voice_assistant_data/exp_dir/FastConformer-Transducer-BPE/2025-03-25_19-18-28/checkpoints/quran_kenlm_3.lexicon
        boost_path: null
        beam_size_token: 32
        beam_threshold: 25.0
        unk_weight: -.inf
        sil_weight: 0.0
      pyctcdecode_cfg:
        beam_prune_logp: -10.0
        token_min_logp: -5.0
        prune_history: false
        hotwords: null
        hotword_weight: 10.0
    wfst:
      beam_size: 4
      search_type: riva
      return_best_hypothesis: true
      preserve_alignments: false
      compute_timestamps: false
      decoding_mode: nbest
      open_vocabulary_decoding: false
      beam_width: 10.0
      lm_weight: 1.0
      device: cuda
      arpa_lm_path: null
      wfst_lm_path: null
      riva_decoding_cfg: {}
      k2_decoding_cfg:
        search_beam: 20.0
        output_beam: 10.0
        min_active_states: 30
        max_active_states: 10000
    confidence_cfg:
      preserve_frame_confidence: false
      preserve_token_confidence: false
      preserve_word_confidence: false
      exclude_blank: true
      aggregation: min
      tdt_include_duration: false
      method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
    temperature: 1.0
@artbataev
Copy link
Collaborator

@rastinrastinii Do you observe improvements without setting model.cfg.decoding.preserve_alignments = True and model.cfg.decoding.compute_timestamps = True?

For now, it's unclear if it is a bug or if you just not observe improvement with LM.
From my experience, I usually do not observe improvement from LM trained on the similar in-domain data.

@rastinrastinii
Copy link
Author

i dont see improvement with it but when separately use pyctcdecode see improvement

@lilithgrigoryan
Copy link
Collaborator

Hi @rastinrastinii.

  1. Can you please clarify what two setups are you comparing? Do you compare pyctcdecode from other library with flashlight in NeMo? Have you tried to runpyctcdecode decoding in NeMo? If not, you can run in by setting decoding_cfg.strategy = "pyctcdecode". Also, since the two decoding implementations differ slightly, it's possible to see an improvement with pyctcdecode while observing no change with Flashlight on a single audio sample. Is your test set big enough?

  2. I haven’t encountered any issues when setting model.cfg.decoding.preserve_alignments = True and model.cfg.decoding.compute_timestamps = True. What lexicon are you using? Since your model uses subword tokenization, decoding with Flashlight requires a custom lexicon that maps subword-based KenLM characters to subword tokens. You can use the following code snippet to generate this custom lexicon:

from nemo.collections.asr.parts.submodules.ngram_lm import DEFAULT_TOKEN_OFFSET
from nemo.collections.asr.models import ASRModel

model=ASRModel.from_pretrained("nvidia/stt_en_fastconformer_ctc_large")
lexicon_filepath= # set path to lexicon file here

def create_lexicon(tokenizer, lexicon_filepath):
    """Creates lexicon file"""
    with open(lexicon_filepath, 'w') as f:
        for id in range(tokenizer.vocab_size):
            word = chr(id + DEFAULT_TOKEN_OFFSET)
            tokens = tokenizer.ids_to_tokens([id])
            f.write(f"{word} " + " ".join(tokens) + '\n')

create_lexicon(model.tokenizer, lexicon_filepath)

Please try decoding with this lexicon. If you still don’t see any improvements, we’ll investigate further.

@rastinrastinii
Copy link
Author

Thanks for your reply but when i set decoding_cfg.strategy = "pyctcdecode" get

Currently this flag is not supported for beam search algorithms.

@lilithgrigoryan
Copy link
Collaborator

lilithgrigoryan commented May 11, 2025

Could you please try setting decoding_cfg.compute_timestamps = False and decoding_cfg.strategy = "pyctcdecode"?

This is to prevent the error you ran into. Even with compute_timestamps=False you will still get timestamps (for strategy pyctcdecode). Consider this as a temporary workaround — I'll share a proper fix soon.

For now, let's verify that decoding with the language model is working and actually influencing the output. Refer to the snippet below:

from nemo.collections.asr.parts.submodules.ctc_decoding import CTCDecodingConfig
from nemo.collections.asr.models import ASRModel

# Load pre-trained ASR model
model = ASRModel.from_pretrained("nvidia/stt_en_fastconformer_ctc_large")

# Initialize a new decoding configuration
decoding_cfg = CTCDecodingConfig()

# Set decoding strategy to pyctcdecode
decoding_cfg.strategy = "pyctcdecode"

# Configure beam search parameters
decoding_cfg.beam.kenlm_path = <path to your KenLM file>  # Must be word-level LM built with original KenLM tools
decoding_cfg.beam.beam_size = 32
decoding_cfg.beam.beam_alpha = 0.2
decoding_cfg.beam.beam_beta = 0.2

# Disable timestamps and alignments
decoding_cfg.preserve_alignments = False
decoding_cfg.compute_timestamps = False

# Apply the new decoding strategy to the model
model.change_decoding_strategy(decoding_cfg)

# Transcribe audio
hyps = model.transcribe(audios, batch_size=32, return_hypotheses=True)

Note: The language model must be a word-level KenLM built using the original KenLM binary tools.
You can build it with the following command:

<kenlm_bin_path>/lmplz -o <ngram_order> --arpa <output.arpa> --prune <prune_levels>

Please, let me know once you've tested it, or if you need help building the KenLM model.

@rastinrastinii
Copy link
Author

Thanks, thats worked. but how can get timestamp while preserve_alignments = False and compute_timestamps = False. its only simple tuple of time step for each word. how get time stamp for seconds or milli seconds?

@lilithgrigoryan
Copy link
Collaborator

lilithgrigoryan commented May 12, 2025

Happy to help!

Currently, pyctcdecode supports only word-level timestamps. However, we're working on adding CTC beam search with more detailed timing support, will reply to this thread once we're done. In the meantime, greedy decoding can be used to obtain full timestamp information.

Please feel free to reopen the issue if you have any further questions or need additional help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants