Skip to content

kokoro TTS Supported axcl backend#3270

Open
Abandon-ht wants to merge 2 commits intok2-fsa:masterfrom
Abandon-ht:kokoro_axcl
Open

kokoro TTS Supported axcl backend#3270
Abandon-ht wants to merge 2 commits intok2-fsa:masterfrom
Abandon-ht:kokoro_axcl

Conversation

@Abandon-ht
Copy link
Copy Markdown
Contributor

@Abandon-ht Abandon-ht commented Mar 9, 2026

Summary by CodeRabbit

Release Notes

  • New Features
    • Added AXCL hardware accelerator support for offline text-to-speech with Kokoro model, enabling voice/speaker selection and speed control for accelerated audio generation.

@dosubot dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Mar 9, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

The changes add AXCL accelerator support for offline text-to-speech generation with the Kokoro model. This includes new model implementations using AXCL components, an additional AxclModel API overload for uint8_t input data, conditional CMake configuration, and integration with the offline TTS factory system.

Changes

Cohort / File(s) Summary
AXCL Model API Enhancement
sherpa-onnx/csrc/axcl/axcl-model.h, sherpa-onnx/csrc/axcl/axcl-model.cc
Added new SetInputTensorData overload accepting uint8_t* input pointer alongside existing float and int32_t overloads.
AXCL Kokoro TTS Model Implementation
sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h, sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc
Introduced OfflineTtsKokoroModelAxcl class implementing multi-model AXCL-based Kokoro TTS pipeline. Orchestrates voice embedding loading and multi-stage inference across AXCL models and ONNX Runtime sessions, including duration prediction, alignment processing, and audio synthesis with post-processing.
AXCL TTS Backend Integration
sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h, sherpa-onnx/csrc/offline-tts-impl.cc
Added template class OfflineTtsKokoroAxclImpl for AXCL-enabled Kokoro backend with frontend initialization, text normalization, token generation, and batch processing. Integrated AXCL provider pathway into OfflineTtsImpl::Create factory method with conditional compilation gating.
Build Configuration
sherpa-onnx/csrc/CMakeLists.txt
Conditional compilation rule to include AXCL TTS source file when both SHERPA_ONNX_ENABLE_AXCL and SHERPA_ONNX_ENABLE_TTS flags are enabled.

Sequence Diagram

sequenceDiagram
    participant Client
    participant OfflineTtsImpl as OfflineTtsImpl<br/>(Factory)
    participant OfflineTtsKokoroAxclImpl as OfflineTtsKokoroAxclImpl<br/>(Backend)
    participant TextNormalizer
    participant OfflineTtsFrontend as OfflineTtsFrontend<br/>(Tokenizer)
    participant OfflineTtsKokoroModelAxcl as OfflineTtsKokoroModelAxcl<br/>(Model)
    participant AxclModel as AxclModel<br/>(AXCL Inference)
    participant OnnxSession as OnnxSession<br/>(ONNX Runtime)

    Client->>OfflineTtsImpl: Create(config with provider="axcl")
    OfflineTtsImpl->>OfflineTtsKokoroAxclImpl: new OfflineTtsKokoroAxclImpl(config)
    OfflineTtsKokoroAxclImpl->>OfflineTtsKokoroAxclImpl: InitFrontend()

    Client->>OfflineTtsKokoroAxclImpl: Generate(text, sid, speed)
    OfflineTtsKokoroAxclImpl->>TextNormalizer: Normalize text
    TextNormalizer-->>OfflineTtsKokoroAxclImpl: normalized_text
    OfflineTtsKokoroAxclImpl->>OfflineTtsFrontend: Convert to tokens
    OfflineTtsFrontend-->>OfflineTtsKokoroAxclImpl: token_ids
    OfflineTtsKokoroAxclImpl->>OfflineTtsKokoroAxclImpl: Batch tokens
    OfflineTtsKokoroAxclImpl->>OfflineTtsKokoroModelAxcl: Run(input_ids, sid, speed)
    
    OfflineTtsKokoroModelAxcl->>AxclModel: SetInputTensorData(uint8_t*)
    AxclModel->>AxclModel: Run() [Model1: Duration/Alignment]
    AxclModel->>AxclModel: Run() [Model2: Refinement]
    AxclModel->>OnnxSession: Run() [Model4: F0/Embedding]
    AxclModel->>AxclModel: Run() [Model3: Final Processing]
    AxclModel-->>OfflineTtsKokoroModelAxcl: raw_audio
    
    OfflineTtsKokoroModelAxcl->>OfflineTtsKokoroModelAxcl: Post-process (iSTFT, trim, adjust)
    OfflineTtsKokoroModelAxcl-->>OfflineTtsKokoroAxclImpl: audio_waveform
    OfflineTtsKokoroAxclImpl-->>Client: GeneratedAudio
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • csukuangfj

Poem

🐰 AXCL accelerators spin,
Kokoro's voice now rushing in,
Models dance in quartet sound—
uint8_t paths are newly found,
TTS magic comes alive! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: adding AXCL backend support for Kokoro TTS, which is demonstrated across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Text-to-Speech capabilities by integrating support for the AXCL backend with Kokoro models. This change allows Kokoro TTS models to leverage AXCL hardware for potentially faster and more efficient inference, broadening the range of supported deployment environments. The underlying implementation has been made more flexible through templating, ensuring maintainability and extensibility for future backend integrations.

Highlights

  • AXCL Backend Support: Added support for the AXCL backend for Kokoro Text-to-Speech (TTS) models, enabling specialized hardware acceleration.
  • New Model Implementation: Introduced a new OfflineTtsKokoroModelAxcl class to encapsulate the inference logic for Kokoro TTS models running on the AXCL backend.
  • Templated Refactoring: Refactored the OfflineTtsKokoroImpl class using C++ templates to support both the original ONNX-based Kokoro models and the new AXCL-based Kokoro models within a unified interface.
  • AXCL Model Input Extension: Extended the AxclModel with a new SetInputTensorData overload to accommodate uint8_t tensor inputs, specifically for handling text masks in the Kokoro model.
  • Integration into Factory: Integrated the AXCL Kokoro model into the OfflineTtsImpl factory, allowing dynamic selection based on the config.model.provider setting.
Changelog
  • sherpa-onnx/csrc/CMakeLists.txt
    • Added offline-tts-kokoro-model-axcl.cc to the source list, conditional on SHERPA_ONNX_ENABLE_AXCL and SHERPA_ONNX_ENABLE_TTS.
  • sherpa-onnx/csrc/axcl/axcl-model.cc
    • Implemented a new SetInputTensorData method that accepts uint8_t pointers.
  • sherpa-onnx/csrc/axcl/axcl-model.h
    • Declared a new SetInputTensorData method that accepts uint8_t pointers.
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc
    • Implemented the OfflineTtsKokoroModelAxcl class, including model initialization, voice embedding loading, and the multi-stage inference process for Kokoro TTS on AXCL.
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h
    • Declared the OfflineTtsKokoroModelAxcl class.
  • sherpa-onnx/csrc/offline-tts-impl.cc
    • Included offline-tts-kokoro-model-axcl.h conditionally.
    • Modified the Create method to instantiate OfflineTtsKokoroImpl with OfflineTtsKokoroModelAxcl when config.model.provider is "axcl".
    • Updated the default OfflineTtsKokoroImpl instantiation to use OfflineTtsKokoroModel via a template.
  • sherpa-onnx/csrc/offline-tts-kokoro-impl.h
    • Included <type_traits>.
    • Templated OfflineTtsKokoroImpl to accept a KokoroModel type.
    • Updated constructors to use the templated KokoroModel.
    • Modified the Generate method to conditionally handle different model_->Run return types based on the KokoroModel type using if constexpr.
Activity
  • No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Kokoro TTS model on the AXCL backend, introducing a new OfflineTtsKokoroModelAxcl implementation and modifying OfflineTtsImpl. A critical security concern is the lack of sufficient validation for user-supplied parameters such as sid (speaker ID) and speed, which can lead to out-of-bounds memory access and Denial of Service (DoS) via Out-Of-Memory (OOM) crashes. Furthermore, the AXCL model implementation appears to be a direct port with several magic numbers and areas for performance optimization, alongside some minor style issues.

Comment on lines +219 to +226
const float *p = styles_.data() + sid * style_dim0 * style_dim1 +
phoneme_len * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
} else {
int32_t idx = style_dim0 / 2;
const float *p =
styles_.data() + sid * style_dim0 * style_dim1 + idx * style_dim1;
std::copy(p, p + style_dim1, ref_s.begin());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The sid (speaker ID) parameter is used to calculate a pointer offset into the styles_ vector without any bounds checking. If an invalid sid is provided (e.g., a value greater than or equal to num_speakers), it will result in an out-of-bounds read. This can lead to a process crash (Denial of Service) or potentially leak sensitive information from the process memory into the generated audio output. Additionally, if sid is large enough, the calculation sid * style_dim0 * style_dim1 could overflow the int32_t type, leading to unpredictable memory access.

Comment on lines +381 to +382
sum /= speed;
pred_dur_original[i] = static_cast<int32_t>(std::max(1.f, std::round(sum)));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The speed parameter is used as a divisor in ProcessDuration without validation. If speed is zero, a division by zero occurs, resulting in an infinite value. If speed is an extremely small positive value, the resulting duration values (pred_dur_original) will be very large. This can lead to an excessively large total_frames value, which is subsequently used to allocate the en vector on line 298. An attacker providing a maliciously small speed value could trigger an Out-Of-Memory (OOM) condition, leading to a Denial of Service (DoS) crash.


bool SetInputTensorData(const std::string &name, const int32_t *p,
int32_t n) const;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This blank line is unnecessary and can be removed for better code formatting.

std::string model1_path = model_path + "/kokoro_part1.axmodel";
std::string model2_path = model_path + "/kokoro_part2.axmodel";
std::string model3_path = model_path + "/kokoro_part3.axmodel";
std::string model4_path = model_path + "/kokoro_part4.onnx";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a trailing space on this line. Please remove it to maintain code style consistency. There are several other lines with trailing spaces in this file (e.g., 172, 282, 296, 375).

    std::string model4_path = model_path + "/kokoro_part4.onnx";

Comment on lines +213 to +214
int32_t style_dim0 = 510; // max_token_len
int32_t style_dim1 = 256;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several magic numbers used in this file (e.g., 510, 256 here, and others like 640, 50, 20, 5, 32 elsewhere). Using named constants would improve readability and maintainability. For instance, style_dim0 is commented as max_token_len, so you could use meta_data_.max_token_len which is initialized to 510 in LoadVoices. The value 256 should also be a named constant.

Comment on lines +300 to +308
for (int32_t j = 0; j < d_shape_2; ++j) {
for (int32_t k = 0; k < total_frames; ++k) {
float sum = 0.0f;
for (int32_t i = 0; i < max_seq_len_; ++i) {
sum += d[i * d_shape_2 + j] * pred_aln_trg[i * total_frames + k];
}
en[j * total_frames + k] = sum;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current loop order for this matrix multiplication (en = d.T @ pred_aln_trg) might not be optimal for cache performance. Access to en is not contiguous within the inner loops. Consider reordering the loops to improve data locality, for instance by iterating over i, then k, then j.

Comment on lines +145 to +146
} // for (const auto &f : files)
} // if (!config.rule_fars.empty())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These comments are redundant as the closing braces are close to their opening statements and properly indented. Consider removing them for cleaner code.

Suggested change
} // for (const auto &f : files)
} // if (!config.rule_fars.empty())
}
}

std::unique_ptr<OfflineTtsImpl> OfflineTtsImpl::Create(
const OfflineTtsConfig &config) {
if (config.model.provider == "axcl") {
#if SHERPA_ONNX_ENABLE_AXCL
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update line 94.

https://github.com/Abandon-ht/sherpa-onnx/blob/0aa35ef05261367979f43ca5f82470dfec701b9a/sherpa-onnx/csrc/offline-tts-impl.cc#L94

If axcl does not support Android, please just print an error message and return nullptr.

@@ -0,0 +1,37 @@
// sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h
//
// Copyright (c) 2025 Xiaomi Corporation
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update it with your info.

void Init(const OfflineTtsModelConfig &config) {
std::string model_path = config.kokoro.model;

// kokoro axcl mode we assume config.kokoro.model is a directory containing models
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the kokoro model config validation function to validate thate the required files exist when axcl is used.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
sherpa-onnx/csrc/offline-tts-impl.cc (1)

93-105: ⚠️ Potential issue | 🟠 Major

Mirror the AXCL provider branch in the manager overload.

Create(mgr, config) never checks config.model.provider, so on Line 104 it always builds OfflineTtsKokoroModel even though the non-manager overload routes provider == "axcl" to OfflineTtsKokoroModelAxcl. The AXCL model already has a manager constructor, so the Android/OHOS resource-manager path is currently ignoring the AXCL backend entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/offline-tts-impl.cc` around lines 93 - 105, The
manager-overload OfflineTtsImpl::Create (template<Manager>) ignores
config.model.provider when selecting the kokoro branch and always constructs
OfflineTtsKokoroImpl<OfflineTtsKokoroModel>; update that branch to mirror the
non-manager overload by checking config.model.provider (e.g., provider ==
"axcl") and construct OfflineTtsKokoroImpl<OfflineTtsKokoroModelAxcl> when
provider is "axcl", otherwise construct
OfflineTtsKokoroImpl<OfflineTtsKokoroModel>; ensure you use the existing
manager-capable constructors for OfflineTtsKokoroImpl so the AXCL backend is
reachable via the manager path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc`:
- Around line 279-282: The SetInputTensorData calls (e.g.,
model1_->SetInputTensorData("input_ids", ...),
model1_->SetInputTensorData("ref_s", ...),
model1_->SetInputTensorData("text_mask", ...)) can return false on name/size
mismatch and must be checked; modify these blocks to short-circuit and abort
inference (return false / propagate error) if any call returns false, and apply
the identical check-and-abort pattern to the corresponding upload calls on
model2_ and model3_ (the groups at lines handling
model2_->SetInputTensorData(...) and model3_->SetInputTensorData(...)) so a
failed upload does not leave device buffers partially updated. Ensure you return
or propagate the error from the current function immediately when any
SetInputTensorData returns false.
- Around line 194-210: LoadVoices currently hardcodes meta_data_ fields and
infers num_speakers by a brittle division; instead parse the kokoro bundle
header from the start of voices_data to populate meta_data_. Specifically, in
LoadVoices read the bundle's metadata/header (sample_rate, version, voice
string, max_token_len and the per-speaker embedding length) and validate
voices_data_length against the expected float count to detect truncation, then
compute num_speakers from the validated float count and set
meta_data_.num_speakers/meta_data_.sample_rate/meta_data_.version/meta_data_.voice/meta_data_.max_token_len
accordingly before constructing styles_, and retain the existing use of
model1_->TensorShape("input_ids") to set max_seq_len_; update error handling to
bail or log if the header is malformed or the blob is too short.
- Around line 235-239: The code currently truncates input_ids when actual_len >
max_seq_len_, producing partial speech; instead, do not resize input_ids—either
(A) return a clear error/status to the caller when actual_len > max_seq_len_
(log via SHERPA_ONNX_LOGE including actual_len and max_seq_len_ and return an
appropriate failure code/object from the surrounding function), or (B) implement
chunking: split input_ids into contiguous chunks of size max_seq_len_, run the
existing TTS generation routine for each chunk and concatenate the resulting
waveforms before returning; update callers to handle the error/concatenated
output. Reference the symbols in the diff: input_ids, max_seq_len_, actual_len
(and the enclosing function in offline-tts-kokoro-model-axcl.cc) when making the
change.

---

Outside diff comments:
In `@sherpa-onnx/csrc/offline-tts-impl.cc`:
- Around line 93-105: The manager-overload OfflineTtsImpl::Create
(template<Manager>) ignores config.model.provider when selecting the kokoro
branch and always constructs OfflineTtsKokoroImpl<OfflineTtsKokoroModel>; update
that branch to mirror the non-manager overload by checking config.model.provider
(e.g., provider == "axcl") and construct
OfflineTtsKokoroImpl<OfflineTtsKokoroModelAxcl> when provider is "axcl",
otherwise construct OfflineTtsKokoroImpl<OfflineTtsKokoroModel>; ensure you use
the existing manager-capable constructors for OfflineTtsKokoroImpl so the AXCL
backend is reachable via the manager path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8ddfb20e-5c39-4d30-b615-ca31ce07720d

📥 Commits

Reviewing files that changed from the base of the PR and between 9b2fc65 and 0aa35ef.

📒 Files selected for processing (7)
  • sherpa-onnx/csrc/CMakeLists.txt
  • sherpa-onnx/csrc/axcl/axcl-model.cc
  • sherpa-onnx/csrc/axcl/axcl-model.h
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.h
  • sherpa-onnx/csrc/offline-tts-impl.cc
  • sherpa-onnx/csrc/offline-tts-kokoro-impl.h

Comment on lines +194 to +210
void LoadVoices(const char *voices_data, size_t voices_data_length) {
// get metadata
meta_data_.sample_rate = 24000;
meta_data_.num_speakers = voices_data_length / (sizeof(float) * 256 * 510);
meta_data_.version = 1;
meta_data_.has_espeak = 1;
meta_data_.voice = "en-us";
meta_data_.max_token_len = 510;

int32_t actual_num_floats = voices_data_length / sizeof(float);
styles_ = std::vector<float>(
reinterpret_cast<const float *>(voices_data),
reinterpret_cast<const float *>(voices_data) + actual_num_floats);

auto shape1 = model1_->TensorShape("input_ids");
max_seq_len_ = shape1[1];
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Derive Kokoro metadata from the bundle instead of hardcoding it.

Lines 196-201 pin sample_rate, version, voice, and max_token_len, while Line 197 floor-divides the voices blob to infer num_speakers. OfflineTtsKokoroImpl::InitFrontend() branches on version and voice, so this AXCL path will mis-handle multilingual/v2 exports and also won't detect a truncated voices file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 194 -
210, LoadVoices currently hardcodes meta_data_ fields and infers num_speakers by
a brittle division; instead parse the kokoro bundle header from the start of
voices_data to populate meta_data_. Specifically, in LoadVoices read the
bundle's metadata/header (sample_rate, version, voice string, max_token_len and
the per-speaker embedding length) and validate voices_data_length against the
expected float count to detect truncation, then compute num_speakers from the
validated float count and set
meta_data_.num_speakers/meta_data_.sample_rate/meta_data_.version/meta_data_.voice/meta_data_.max_token_len
accordingly before constructing styles_, and retain the existing use of
model1_->TensorShape("input_ids") to set max_seq_len_; update error handling to
bail or log if the header is malformed or the blob is too short.

Comment on lines +235 to +239
if (actual_len > max_seq_len_) {
SHERPA_ONNX_LOGE("Input ids length %d exceeds max_seq_len %d, truncating.",
actual_len, max_seq_len_);
input_ids.resize(max_seq_len_);
actual_len = max_seq_len_;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t silently drop over-length text.

Lines 236-239 resize input_ids to max_seq_len_, so the returned waveform omits the tail of the utterance. The existing ORT Kokoro path fails fast for this case; AXCL should either chunk the sentence or return an error instead of generating partial speech.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 235 -
239, The code currently truncates input_ids when actual_len > max_seq_len_,
producing partial speech; instead, do not resize input_ids—either (A) return a
clear error/status to the caller when actual_len > max_seq_len_ (log via
SHERPA_ONNX_LOGE including actual_len and max_seq_len_ and return an appropriate
failure code/object from the surrounding function), or (B) implement chunking:
split input_ids into contiguous chunks of size max_seq_len_, run the existing
TTS generation routine for each chunk and concatenate the resulting waveforms
before returning; update callers to handle the error/concatenated output.
Reference the symbols in the diff: input_ids, max_seq_len_, actual_len (and the
enclosing function in offline-tts-kokoro-model-axcl.cc) when making the change.

Comment on lines +279 to +282
model1_->SetInputTensorData("input_ids", input_ids_i32.data(),
input_ids_i32.size());
model1_->SetInputTensorData("ref_s", ref_s.data(), ref_s.size());
model1_->SetInputTensorData("text_mask", text_mask_u8.data(), text_mask_u8.size());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Abort when an AXCL input upload fails.

Each SetInputTensorData() call here can return false on a name or size mismatch, but the result is ignored and inference still continues. That can leave device buffers partially updated from a previous utterance and produce corrupted audio.

Minimal fix sketch
-    model1_->SetInputTensorData("input_ids", input_ids_i32.data(),
-                                input_ids_i32.size());
-    model1_->SetInputTensorData("ref_s", ref_s.data(), ref_s.size());
-    model1_->SetInputTensorData("text_mask", text_mask_u8.data(), text_mask_u8.size()); 
+    if (!model1_->SetInputTensorData("input_ids", input_ids_i32.data(),
+                                     input_ids_i32.size()) ||
+        !model1_->SetInputTensorData("ref_s", ref_s.data(), ref_s.size()) ||
+        !model1_->SetInputTensorData("text_mask", text_mask_u8.data(),
+                                     text_mask_u8.size())) {
+      return false;
+    }

Please apply the same short-circuit to the model2_ and model3_ uploads on Lines 312-316 and Lines 344-348.

Also applies to: 312-316, 344-348

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-model-axcl.cc` around lines 279 -
282, The SetInputTensorData calls (e.g.,
model1_->SetInputTensorData("input_ids", ...),
model1_->SetInputTensorData("ref_s", ...),
model1_->SetInputTensorData("text_mask", ...)) can return false on name/size
mismatch and must be checked; modify these blocks to short-circuit and abort
inference (return false / propagate error) if any call returns false, and apply
the identical check-and-abort pattern to the corresponding upload calls on
model2_ and model3_ (the groups at lines handling
model2_->SetInputTensorData(...) and model3_->SetInputTensorData(...)) so a
failed upload does not leave device buffers partially updated. Ensure you return
or propagate the error from the current function immediately when any
SetInputTensorData returns false.

@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Mar 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
sherpa-onnx/csrc/offline-tts-impl.cc (1)

92-110: ⚠️ Potential issue | 🟠 Major

Mirror the AXCL provider dispatch in the manager overload.

Create(const OfflineTtsConfig&) now special-cases provider == "axcl", but Create(Manager *, const OfflineTtsConfig &) still ignores config.model.provider. On AAssetManager/NativeResourceManager builds, an AXCL Kokoro config will fall through to OfflineTtsKokoroImpl instead of the new AXCL path, or continue on platforms where AXCL should be rejected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/offline-tts-impl.cc` around lines 92 - 110, The Manager
overload of OfflineTtsImpl::Create currently ignores config.model.provider so
AXCL configs misroute; update OfflineTtsImpl::Create(Manager*, const
OfflineTtsConfig&) to mirror the provider == "axcl" branch from the other
overload: check config.model.provider == "axcl" early and return the
AXCL-specific implementation (e.g., std::make_unique<OfflineTtsAxclImpl>(mgr,
config)) or otherwise reproduce the same accept/reject logic used by
Create(const OfflineTtsConfig&), ensuring the new AXCL path is taken
consistently for AAssetManager/NativeResourceManager builds.
🧹 Nitpick comments (1)
sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h (1)

253-272: Collapse the fake batching path or implement real batching.

This method logs that max_num_sentences is ignored, hard-codes batch_size = 1, and Process() still flattens every token list into a single sequence. The remainder block at Lines 313-331 is dead today, and the current API shape makes it look like multi-sentence batches are supported when they are not.

Also applies to: 294-331, 389-418

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h` around lines 253 - 272,
The batching code is fake: config_.max_num_sentences is logged as ignored but
batch_x, batch_size and Process() still assume/flatten a single sequence; either
remove the dormant batching path or implement real batching. Fix by removing the
misleading log and all batch_x/batch_size dead logic if you do not support
multi-sentence batching, or implement real batching: honor
config_.max_num_sentences when computing batch_size, split the input token
vectors into multiple batches (populate batch_x accordingly), update Process()
to accept and iterate over these batches instead of flattening every token list,
and ensure any downstream code expecting single-sequence input is adjusted;
touch symbols config_.max_num_sentences, batch_x, batch_size and Process() (and
the surrounding blocks noted) so behavior and API match (either single-sentence
only or true multi-sentence batching).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h`:
- Around line 163-175: The branch that checks if (num_speakers == 0 && sid != 0)
logs that a non-zero sid is ignored but never actually normalizes it; update
that branch in offline-tts-kokoro-axcl-impl.h to set sid = 0 after logging so
callers (and subsequent code like Process()) can't use an out-of-contract
speaker id; locate the check using the symbols num_speakers and sid and add the
assignment to reset sid to zero inside that conditional (preserve the existing
platform-specific logging).
- Around line 75-83: The code calls fst::FarReader<fst::StdArc>::Open(f) and
then dereferences reader (reader->Done()) without checking for null; replace
this by guarding the result of FarReader::Open(f) (the local variable reader)
before the for-loop: if Open returned nullptr, emit a clear error (e.g., process
logger/error stream or throw std::runtime_error) indicating the archive could
not be opened and abort initialization, otherwise iterate with reader->Done(),
reader->Next(), use reader->GetFst() and construct kaldifst::TextNormalizer
instances to push into tn_list_ as before.

---

Duplicate comments:
In `@sherpa-onnx/csrc/offline-tts-impl.cc`:
- Around line 92-110: The Manager overload of OfflineTtsImpl::Create currently
ignores config.model.provider so AXCL configs misroute; update
OfflineTtsImpl::Create(Manager*, const OfflineTtsConfig&) to mirror the provider
== "axcl" branch from the other overload: check config.model.provider == "axcl"
early and return the AXCL-specific implementation (e.g.,
std::make_unique<OfflineTtsAxclImpl>(mgr, config)) or otherwise reproduce the
same accept/reject logic used by Create(const OfflineTtsConfig&), ensuring the
new AXCL path is taken consistently for AAssetManager/NativeResourceManager
builds.

---

Nitpick comments:
In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h`:
- Around line 253-272: The batching code is fake: config_.max_num_sentences is
logged as ignored but batch_x, batch_size and Process() still assume/flatten a
single sequence; either remove the dormant batching path or implement real
batching. Fix by removing the misleading log and all batch_x/batch_size dead
logic if you do not support multi-sentence batching, or implement real batching:
honor config_.max_num_sentences when computing batch_size, split the input token
vectors into multiple batches (populate batch_x accordingly), update Process()
to accept and iterate over these batches instead of flattening every token list,
and ensure any downstream code expecting single-sequence input is adjusted;
touch symbols config_.max_num_sentences, batch_x, batch_size and Process() (and
the surrounding blocks noted) so behavior and API match (either single-sentence
only or true multi-sentence batching).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 29f8db44-6f5c-4191-b9aa-8624a07c4575

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa35ef and 0463121.

📒 Files selected for processing (2)
  • sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h
  • sherpa-onnx/csrc/offline-tts-impl.cc

Comment on lines +75 to +83
std::unique_ptr<fst::FarReader<fst::StdArc>> reader(
fst::FarReader<fst::StdArc>::Open(f));
for (; !reader->Done(); reader->Next()) {
std::unique_ptr<fst::StdConstFst> r(
fst::CastOrConvertToConstFst(reader->GetFst()->Copy()));

tn_list_.push_back(
std::make_unique<kaldifst::TextNormalizer>(std::move(r)));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file
find . -name "offline-tts-kokoro-axcl-impl.h" -type f

Repository: k2-fsa/sherpa-onnx

Length of output: 116


🏁 Script executed:

# Once we find it, let's look at the specific lines (75-83)
cat -n sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h | sed -n '70,90p'

Repository: k2-fsa/sherpa-onnx

Length of output: 852


🌐 Web query:

OpenFst FarReader::Open() return value NULL error handling

💡 Result:

fst::FarReader<Arc>::Open(...) is specified to return NULL on error. In practice (per far.h), it returns 0 when the input can’t be recognized/opened as an FAR in any supported format (STTable, STList, or single Fst).[1] The intended pattern is:

std::unique_ptr<fst::FarReader<fst::StdArc>> reader(
    fst::FarReader<fst::StdArc>::Open("in.far"));

if (!reader) {
  // Open failed: file missing/unreadable, wrong/unknown format, etc.
  // (Diagnostics may have been logged to stderr via OpenFst logging.)
  return 1;
}

// After Open succeeds, still check for read/iteration errors:
if (reader->Error()) return 1;  // or check during/after iteration

Key points for error handling:

  • Always null-check the return; OpenFst explicitly treats “functions/methods that return NULL” as non-fatal error signaling even when FLAGS_fst_error_fatal is true.[2]
  • If Open() succeeds, use the reader’s Error() method to detect failures during reading/iteration.[3]
  • Make sure FAR support is actually built/linked (--enable-far, libfstfar), otherwise your program may not be able to read FARs as expected.[4]

Sources:
[1] fst/extensions/far/far.h (FarReader::Open returns NULL when no format matches) (android.googlesource.com)
[2] OpenFst “Advanced Usage” – Error handling (NULL return as error signal; FLAGS_fst_error_fatal) (openfst.org)
[3] OpenFst doxygen – FarReader-derived classes expose Error() (openfst.org)
[4] OpenFst “Extensions” – FAR extension build/link notes (openfst.org)

Citations:


Guard FarReader::Open() before the loop.

reader is dereferenced immediately without null checking. OpenFst's FarReader<Arc>::Open() returns NULL when the archive cannot be opened (missing file, unrecognized format, etc.). Calling reader->Done() on line 77 will crash with a null pointer dereference instead of surfacing a config error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h` around lines 75 - 83,
The code calls fst::FarReader<fst::StdArc>::Open(f) and then dereferences reader
(reader->Done()) without checking for null; replace this by guarding the result
of FarReader::Open(f) (the local variable reader) before the for-loop: if Open
returned nullptr, emit a clear error (e.g., process logger/error stream or throw
std::runtime_error) indicating the archive could not be opened and abort
initialization, otherwise iterate with reader->Done(), reader->Next(), use
reader->GetFst() and construct kaldifst::TextNormalizer instances to push into
tn_list_ as before.

Comment on lines +163 to +175
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset sid in the single-speaker branch too.

The log says non-zero sid is ignored, but this branch never normalizes it. Since the base API allows single-speaker models to report num_speakers == 0, you can still pass an out-of-contract speaker id into Process().

Suggested fix
     if (num_speakers == 0 && sid != 0) {
 `#if` __OHOS__
       SHERPA_ONNX_LOGE(
           "This is a single-speaker model and supports only sid 0. Given sid: "
           "%{public}d. sid is ignored",
           static_cast<int32_t>(sid));
 `#else`
       SHERPA_ONNX_LOGE(
           "This is a single-speaker model and supports only sid 0. Given sid: "
           "%d. sid is ignored",
           static_cast<int32_t>(sid));
 `#endif`
+      sid = 0;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (num_speakers == 0 && sid != 0) {
#if __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
#else
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
#endif
}
if (num_speakers == 0 && sid != 0) {
`#if` __OHOS__
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%{public}d. sid is ignored",
static_cast<int32_t>(sid));
`#else`
SHERPA_ONNX_LOGE(
"This is a single-speaker model and supports only sid 0. Given sid: "
"%d. sid is ignored",
static_cast<int32_t>(sid));
`#endif`
sid = 0;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h` around lines 163 - 175,
The branch that checks if (num_speakers == 0 && sid != 0) logs that a non-zero
sid is ignored but never actually normalizes it; update that branch in
offline-tts-kokoro-axcl-impl.h to set sid = 0 after logging so callers (and
subsequent code like Process()) can't use an out-of-contract speaker id; locate
the check using the symbols num_speakers and sid and add the assignment to reset
sid to zero inside that conditional (preserve the existing platform-specific
logging).

@@ -0,0 +1,450 @@
// sherpa-onnx/csrc/axcl/offline-tts-kokoro-axcl-impl.h
//
// Copyright (c) 2025 Xiaomi Corporation
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update it with your info.

Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeDefault);

std::array<int64_t, 2> x_shape = {1, static_cast<int32_t>(x.size())};
Ort::Value x_tensor = Ort::Value::CreateTensor(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove everything related to onnxrutnime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants