Skip to content

Add Zipvoice#2487

Merged
pkufool merged 11 commits intok2-fsa:masterfrom
pkufool:zipvoice
Aug 27, 2025
Merged

Add Zipvoice#2487
pkufool merged 11 commits intok2-fsa:masterfrom
pkufool:zipvoice

Conversation

@pkufool
Copy link
Copy Markdown
Contributor

@pkufool pkufool commented Aug 13, 2025

Summary by CodeRabbit

  • New Features

    • Added offline zero-shot TTS via ZipVoice with prompt audio/text support and multilingual tokenization (English/Chinese with pinyin/espeak).
    • New CLI tool for zero-shot TTS.
  • Python API

    • Added OfflineTtsZipvoiceModelConfig and generate() overload accepting prompt audio.
    • New example: offline-zeroshot-tts.py.
  • Improvements

    • Windows UTF‑8 handling for generation.
    • Vocoder selection supports ZipVoice.
    • Added UTF/CJK text utilities.
  • Chores

    • Integrated cppinyin into the build.
  • Tests

    • Added ZipVoice frontend and text-utils tests.

@pkufool pkufool marked this pull request as draft August 13, 2025 01:29
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 15, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds zero-shot ZipVoice TTS: new ZipVoice frontend, dual-ONNX ZipVoice model and impl, vocoder selection, ZipVoice config/metadata and bindings, CLI and Python examples, text utilities, tests, and a CMake module to fetch cppinyin.

Changes

Cohort / File(s) Summary
Build & deps
CMakeLists.txt, cmake/cppinyin.cmake, sherpa-onnx/csrc/CMakeLists.txt, sherpa-onnx/python/csrc/CMakeLists.txt
Adds new CMake module cmake/cppinyin.cmake; includes it from top-level CMake; links cppinyin_core into sherpa-onnx-core; adds ZipVoice sources/tests and adds zipvoice model-config to Python build.
Core TTS API & factory
sherpa-onnx/csrc/offline-tts.h, .../offline-tts.cc, .../offline-tts-impl.h, .../offline-tts-impl.cc
Adds zero-shot Generate overload (prompt_text/prompt_samples/sample_rate/num_steps); implements Windows UTF‑8 handling; changes base OfflineTtsImpl methods from pure virtual to runtime-throwing stubs; factory creates ZipVoice impl when configured.
ZipVoice frontend
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h, .../offline-tts-zipvoice-frontend.cc, .../offline-tts-zipvoice-frontend-test.cc
New ZipVoice frontend: token/pinyin/espeak tokenization, token map loading, ConvertTextToTokenIds API, platform manager constructors, and a unit test (skips if data missing).
ZipVoice model & impl
sherpa-onnx/csrc/offline-tts-zipvoice-model.h, .../offline-tts-zipvoice-model.cc, .../offline-tts-zipvoice-impl.h, .../offline-tts-zipvoice-model-meta-data.h
New OfflineTtsZipvoiceModel (dual ONNX sessions: text-encoder + flow-matching), metadata struct, Run pipeline; new OfflineTtsZipvoiceImpl wiring frontend, mel computation, model run, and vocoder synthesis.
Configs & bindings
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h, .../offline-tts-zipvoice-model-config.cc, .../offline-tts-model-config.h, .../offline-tts-model-config.cc, python/csrc/offline-tts-zipvoice-model-config.h, python/csrc/offline-tts-zipvoice-model-config.cc, python/csrc/offline-tts-model-config.cc, python/sherpa_onnx/__init__.py
Adds OfflineTtsZipvoiceModelConfig + Register/Validate/ToString; integrates zipvoice into aggregate OfflineTtsModelConfig; exposes ZipVoice config to Python and adds zipvoice field to Python OfflineTtsModelConfig; exports symbol in package.
Vocoder selection
sherpa-onnx/csrc/vocoder.cc, sherpa-onnx/csrc/vocos-vocoder.cc
Changes vocoder loading to prefer matcha.vocoder then zipvoice.vocoder; adds logging and error path (exit or nullptr for templated manager variant).
CLI & examples
sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc, python-api-examples/offline-zeroshot-tts.py, python-api-examples/offline-tts.py
Adds new zero-shot C++ CLI binary; adds Python zero-shot example; minor formatting changes in existing example.
Text utils & tests
sherpa-onnx/csrc/text-utils.h, sherpa-onnx/csrc/text-utils.cc, sherpa-onnx/csrc/text-utils-test.cc
Adds Utf8<->Utf32 conversions, ASCII case helpers, CJK detection, StringToBool declaration/impl; adds a unit test for invalid-UTF8 handling.
Minor formatting
sherpa-onnx/csrc/offline-tts-matcha-impl.h
Whitespace/formatting normalization only.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User (CLI/Python)
  participant API as OfflineTts
  participant Impl as OfflineTtsZipvoiceImpl
  participant FE as ZipVoice Frontend
  participant ZM as ZipVoice Model (text + flow)
  participant VOC as Vocoder
  participant IO as File I/O

  User->>API: Generate(text, prompt_text, prompt_samples, sr, speed, num_steps, cb)
  API->>Impl: forward call
  Impl->>FE: ConvertTextToTokenIds(text, voice)
  FE->>IO: load tokens/pinyin/espeak data (once)
  FE-->>Impl: tokens, prompt_tokens
  Impl->>Impl: ComputeMelSpectrogram(prompt_samples, sr)
  Impl->>ZM: Run(tokens, prompt_tokens, prompt_features, speed, num_steps)
  ZM-->>Impl: mel frames
  Impl->>VOC: Synthesize(mel)
  VOC-->>Impl: waveform
  Impl-->>API: GeneratedAudio
  API-->>User: WAV + metadata
  rect rgba(200,255,200,0.12)
    note over FE,ZM: New ZipVoice zero-shot flow (pinyin/espeak + dual-model loop)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • csukuangfj

Poem

A rabbit hums in code and cheer,
Pinyin hops and espeak near.
ZipVoice spins its melty song,
Vocoder rounds the waveform strong.
Off it hops — a WAV appears! 🥕🎶

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@pkufool pkufool marked this pull request as ready for review August 21, 2025 12:05
@pkufool pkufool requested a review from Copilot August 21, 2025 12:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for ZipVoice zero-shot text-to-speech (TTS) models to sherpa-onnx. ZipVoice is a flow-matching-based TTS model that can generate speech with voice cloning capabilities using a prompt audio sample and its transcription.

Key changes include:

  • Implementation of ZipVoice model architecture with text encoder and flow-matching decoder
  • New ZipVoice frontend for text processing with support for Chinese pinyin and multilingual text
  • Addition of zero-shot TTS interface that accepts prompt audio and text for voice cloning

Reviewed Changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sherpa-onnx/python/sherpa_onnx/init.py Add ZipVoice model config import
sherpa-onnx/python/csrc/offline-tts.cc Add Python bindings for zero-shot TTS generation
sherpa-onnx/csrc/offline-tts-zipvoice-* Core ZipVoice model implementation and configuration
sherpa-onnx/csrc/text-utils.* Add UTF-8/UTF-32 conversion and CJK text detection utilities
sherpa-onnx/csrc/offline-tts.* Add zero-shot TTS interface to main TTS class
sherpa-onnx/csrc/vocoder.cc Update vocoder to support ZipVoice models
python-api-examples/offline-zeroshot-tts.py Python example for zero-shot TTS usage
cmake/cppinyin.cmake Add cppinyin dependency for Chinese text processing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread sherpa-onnx/csrc/text-utils.cc Outdated

bool StringToBool(const std::string &s) {
std::string lower = s;
std::transform(lower.begin(), lower.end(), lower.begin(), ::tolower);
Copy link

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a copy of the string for case conversion is inefficient. Consider using std::transform directly on a pre-allocated string or using the existing ToLowerAscii function.

Suggested change
std::transform(lower.begin(), lower.end(), lower.begin(), ::tolower);
std::string lower = ToLowerAscii(s);

Copilot uses AI. Check for mistakes.
}
if (debug_) {
debug_oss << "Tokens and IDs: \n";
for (int32_t i; i < tokens.size(); i++) {
Copy link

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable 'i' is declared but not initialized in the for loop. This should be 'int32_t i = 0'.

Suggested change
for (int32_t i; i < tokens.size(); i++) {
for (int32_t i = 0; i < tokens.size(); i++) {

Copilot uses AI. Check for mistakes.
SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
buffer = ReadFile(config.matcha.vocoder);
} else if (!config.zipvoice.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
Copy link

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using SHERPA_ONNX_LOGE for informational messages about vocoder selection is misleading. These should use SHERPA_ONNX_LOGI or similar for info-level logging.

Suggested change
SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
SHERPA_ONNX_LOGI("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
buffer = ReadFile(config.matcha.vocoder);
} else if (!config.zipvoice.vocoder.empty()) {
SHERPA_ONNX_LOGI("Using zipvoice vocoder: %s",

Copilot uses AI. Check for mistakes.
SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
buffer = ReadFile(mgr, config.matcha.vocoder);
} else if (!config.zipvoice.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
Copy link

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using SHERPA_ONNX_LOGE for informational messages about vocoder selection is misleading. These should use SHERPA_ONNX_LOGI or similar for info-level logging.

Suggested change
SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
SHERPA_ONNX_LOGI("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
buffer = ReadFile(mgr, config.matcha.vocoder);
} else if (!config.zipvoice.vocoder.empty()) {
SHERPA_ONNX_LOGI("Using zipvoice vocoder: %s",

Copilot uses AI. Check for mistakes.
"Path to zipvoice flow-matching model");
po->Register("zipvoice-vocoder", &vocoder, "Path to zipvoice vocoder");
po->Register("zipvoice-num-steps", &num_steps,
"Number of inference steps for ZipVoice (default: 16)");
Copy link

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value in the constructor is 16, but in the header file the default is also 16, yet in some other files it's referenced as 4. This inconsistency could cause confusion.

Suggested change
"Number of inference steps for ZipVoice (default: 16)");
"Number of inference steps for ZipVoice (default: 4)");

Copilot uses AI. Check for mistakes.
oss << parts[i];
t_lang = types[i];
} else {
if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) {
Copy link

Copilot AI Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logical condition (types[i] != "tag" || types[i] != "pinyin") will always be true since a type cannot be both "tag" and "pinyin" simultaneously. This should likely be && instead of ||.

Suggested change
if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) {
if (t_lang == "other" && (types[i] != "tag" && types[i] != "pinyin")) {

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cmake/cppinyin.cmake (1)

74-75: Avoid auto-invoking download_cppinyin() in the module

Confirmed that v0.10 of cppinyin defines the cppinyin_core target, so removing the in-module download is safe.

• cmake/cppinyin.cmake (around lines 74–75): remove the unconditional call to download_cppinyin()
• Top-level CMakeLists.txt: invoke download_cppinyin() only when SHERPA_ONNX_ENABLE_TTS is ON

Suggested diff:

--- a/cmake/cppinyin.cmake
+++ b/cmake/cppinyin.cmake
@@ 74,75
- download_cppinyin()
sherpa-onnx/python/csrc/offline-tts.cc (1)

68-81: Harden the Python-callback bridge against exceptions

If the Python callback raises, the exception will propagate across the C++ callback boundary and can crash the synthesis pipeline unpredictably. Wrap the call in try/catch and return 0 on failure to signal “stop,” plus optionally log.

Apply this diff inside callback_wrapper:

-                  return callback(array, progress);
+                  try {
+                    return callback(array, progress);
+                  } catch (const py::error_already_set &e) {
+                    // Log and request early stop
+                    py::print("offline_tts.generate callback raised:", e.what());
+                    return 0;
+                  }
🧹 Nitpick comments (41)
sherpa-onnx/csrc/text-utils-test.cc (4)

85-89: Deduplicate redundant tests for the same 0x20 0xC4 case

There are 4 near-identical tests asserting that a space followed by 0xC4 results in a single space. This bloats test runtime and maintenance surface without adding coverage.

You can consolidate these into a single test or a small parameterized test:

-TEST(RemoveInvalidUtf8Sequences, BreakingCase_SpaceFollowedByInvalidByte) {
-  std::string input = "\x20\xC4";  // Space followed by an invalid byte
-  std::string expected = " ";      // 0xC4 removed
-  EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected);
-}
-...
-TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidByte_Breaking) {
-  std::string input = "\x20\xc4";  // Space followed by invalid `0xc4`
-  std::string expected = " ";      // `0xc4` should be removed, space remains
-  EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected);
-}
+TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidC4) {
+  for (std::string input : {std::string("\x20\xC4", 2), std::string("\x20\xc4", 2)}) {
+    EXPECT_EQ(RemoveInvalidUtf8Sequences(input), " ");
+  }
+}

Also applies to: 102-106, 108-112, 114-118


133-143: Turn SplitUtf8(SplitZhAndEn) into an assertion-based test (avoid stdout-only tests)

This test only prints tokens and never asserts, so it cannot catch regressions.

Minimal conversion into a real test:

 TEST(SplitUtf8, SplitZhAndEn) {
   std::string text =
       "Hello, 世界! 123. 你好<ha3>, 你好<ha4>! world is beautiful.";
   auto words = SplitUtf8(text);

-  for (auto &word : words) {
-    std::cout << word << " ## ";
-  }
-  std::cout << "\n";
+  // Basic sanity checks without overspecifying tokenization
+  ASSERT_FALSE(words.empty());
+  // Should contain both English and CJK chunks
+  bool has_hello = std::find(words.begin(), words.end(), "Hello") != words.end();
+  bool has_world = std::find(words.begin(), words.end(), "world") != words.end();
+  bool has_cjk   = std::any_of(words.begin(), words.end(),
+                               [](const std::string& w){ return ContainsCJK(w); });
+  EXPECT_TRUE(has_hello || has_world);
+  EXPECT_TRUE(has_cjk);
 }

Note: Add #include <algorithm> at the top of this test file.


165-171: Cast to unsigned char before calling std::isalpha to avoid UB

std::isalpha is undefined for negative char values. Even though you constrain to word.size() == 1, defensive casting is best practice.

-    if (word.size() == 1 && std::isalpha(word[0])) {
+    if (word.size() == 1 &&
+        std::isalpha(static_cast<unsigned char>(word[0]))) {

181-215: Make SplitUtf8(SplitRegex) assertive and avoid console output in tests

This test prints intermediate tokens and “Sentence: …” lines but never asserts on expected segmentation or language tags. It will pass regardless of correctness.

  • Add assertions for a few key expectations (e.g., that / are tagged as “pinyin”, bracketed [S1]/[S2] as “tag”, and that at least one “zh” and “en” segment exist).
  • Consider extracting the segmentation into a helper and adding focused unit tests for:
    • correct grouping across “other”
    • transitions across “tag”/“pinyin”
    • ContainsCJK-driven classification

If helpful, I can propose a concrete assertion set tailored to the current tokenizer behavior.

sherpa-onnx/csrc/text-utils.cc (2)

727-735: codecvt/wstring_convert is deprecated; consider a future-proof conversion path

Both Utf8ToUtf32/Utf32ToUtf8 rely on std::wstring_convert<std::codecvt_utf8<...>>, deprecated since C++17 and removed in some C++20+ libstdc++/libc++ builds. The file already uses codecvt elsewhere, but it’s worth isolating this behind a helper/adapter to ease future replacement.

  • Introduce a small utf conversion shim (single header) and centralize all conversions there.
  • Consider alternatives: ICU, Boost.Locale, or a tiny embedded UTF-8 library (e.g., utf8cpp) for UTF-8⇄UTF-32.
  • At minimum, add compile-time deprecation suppression guards and a TODO pointing to a planned migration.

777-790: Tighten StringToBool parsing; avoid accepting partial numeric tokens (e.g., "3abc")

std::istringstream will parse an initial integer prefix and ignore trailing junk, causing "3abc" to return true. Also, leading/trailing whitespace and synonyms (“yes/no”, “on/off”, “y/n”) might be expected.

Consider:

  • Trim whitespace before checks.
  • Accept common synonyms.
  • Require full numeric parse (no trailing characters). Example fix:
 bool StringToBool(const std::string &s) {
-  std::string lower = s;
+  std::string lower = s;
   std::transform(lower.begin(), lower.end(), lower.begin(), ::tolower);
 
   if (lower == "true" || lower == "1") return true;
   if (lower == "false" || lower == "0") return false;
 
-  // Try istringstream for numbers
-  std::istringstream is(lower);
-  int32_t num;
-  if (is >> num) return num != 0;
+  if (lower == "yes" || lower == "y" || lower == "on") return true;
+  if (lower == "no" || lower == "n" || lower == "off") return false;
+
+  // Strict numeric parse: no trailing characters
+  std::istringstream is(lower);
+  is >> std::ws;
+  int32_t num;
+  if (is >> num) {
+    is >> std::ws;
+    if (is.eof()) return num != 0;
+  }
 
   return false;  // default if not matched
 }
sherpa-onnx/csrc/text-utils.h (1)

152-174: Public API additions look good; consider documenting deprecation caveat

The new declarations are coherent and match definitions. Since implementations use wstring_convert/codecvt, consider a brief note in the header (or docs) that these functions may change implementation in future to move away from deprecated codecvt.

python-api-examples/offline-tts.py (1)

456-458: Route error output to stderr

Tiny polish: since this is an error path, print to stderr so it doesn’t get mixed with normal output.

Apply this diff:

@@
-        print(
-            "Error in generating audios. Please read previous error messages."
-        )
+        print(
+            "Error in generating audios. Please read previous error messages.",
+            file=sys.stderr,
+        )

And add the import once near the other imports:

@@
-import argparse
+import argparse
+import sys
cmake/cppinyin.cmake (5)

10-26: Offline tarball probing — add env/cache overrides; quote paths

Allow users/CI to specify tarball via env or cache var; and keep quoting consistent.

@@
-  set(possible_file_locations
+  # Allow overrides
+  if(DEFINED ENV{CPPINYIN_TARBALL})
+    list(INSERT possible_file_locations 0 $ENV{CPPINYIN_TARBALL})
+  endif()
+  if(CPPINYIN_TARBALL)
+    list(INSERT possible_file_locations 0 ${CPPINYIN_TARBALL})
+  endif()
+
+  set(possible_file_locations
     $ENV{HOME}/Downloads/cppinyin-0.10.tar.gz
     ${CMAKE_SOURCE_DIR}/cppinyin-0.10.tar.gz
     ${CMAKE_BINARY_DIR}/cppinyin-0.10.tar.gz
     /tmp/cppinyin-0.10.tar.gz
     /star-fj/fangjun/download/github/cppinyin-0.10.tar.gz
   )
@@
-    if(EXISTS ${f})
-      set(cppinyin_URL  "${f}")
-      file(TO_CMAKE_PATH "${cppinyin_URL}" cppinyin_URL)
+    if(EXISTS "${f}")
+      set(cppinyin_URL  "${f}")
+      file(TO_CMAKE_PATH "${cppinyin_URL}" cppinyin_URL)
       message(STATUS "Found local downloaded cppinyin: ${cppinyin_URL}")
       set(cppinyin_URL2)
       break()
     endif()
   endforeach()

31-38: FetchContent_Declare: guard optional secondary URL

Passing an empty string as a URL argument is harmless in many cases but fragile across CMake versions. Guard it.

-  FetchContent_Declare(cppinyin
-    URL
-      ${cppinyin_URL}
-      ${cppinyin_URL2}
-    URL_HASH
-      ${cppinyin_HASH}
-  )
+  if(cppinyin_URL2)
+    FetchContent_Declare(cppinyin
+      URL ${cppinyin_URL} ${cppinyin_URL2}
+      URL_HASH ${cppinyin_HASH})
+  else()
+    FetchContent_Declare(cppinyin
+      URL ${cppinyin_URL}
+      URL_HASH ${cppinyin_HASH})
+  endif()

46-61: Temporarily toggling BUILD_SHARED_LIBS — acceptable; add a safety restore

This pattern is OK, but make the restore unconditional to avoid surprises if control flow changes later.

-  if(_build_shared_libs_bak)
-    set_target_properties(cppinyin_core
-      PROPERTIES
-        POSITION_INDEPENDENT_CODE ON
-        C_VISIBILITY_PRESET hidden
-        CXX_VISIBILITY_PRESET hidden
-    )
-    set(BUILD_SHARED_LIBS ON)
-  endif()
+  if(_build_shared_libs_bak)
+    set_target_properties(cppinyin_core
+      PROPERTIES
+        POSITION_INDEPENDENT_CODE ON
+        C_VISIBILITY_PRESET hidden
+        CXX_VISIBILITY_PRESET hidden)
+  endif()
+  # Always restore to original value
+  if(DEFINED _build_shared_libs_bak)
+    set(BUILD_SHARED_LIBS ${_build_shared_libs_bak})
+    unset(_build_shared_libs_bak)
+  endif()

63-70: Target interface includes — prefer INTERFACE for consumer-only exposure

Since we don’t build sources here, exposing includes as INTERFACE makes intent clearer.

-  target_include_directories(cppinyin_core
-    PUBLIC
-      ${cppinyin_SOURCE_DIR}/
-  )
+  target_include_directories(cppinyin_core
+    INTERFACE
+      ${cppinyin_SOURCE_DIR}/)

4-7: cppinyin v0.10 checksum verified; consider optional override support

  • SHA256 for v0.10 tarball (abe6584…19982) has been confirmed to match the release archive.
  • Optional: add a CMake cache variable so downstream users can supply their own mirror, path, or version without patching the script.

Suggested diff to introduce a CPPINYIN_TARBALL override:

 function(download_cppinyin)
   include(FetchContent)

-  set(cppinyin_URL "https://github.com/pkufool/cppinyin/archive/refs/tags/v0.10.tar.gz")
-  set(cppinyin_URL2 "https://gh-proxy.com/https://github.com/pkufool/cppinyin/archive/refs/tags/v0.10.tar.gz")
+  set(CPPINYIN_TARBALL "" CACHE PATH
+      "Optional path or URL to a pre-downloaded cppinyin v0.10 tarball")
+  set(cppinyin_URL "https://github.com/pkufool/cppinyin/archive/refs/tags/v0.10.tar.gz")
+  set(cppinyin_URL2 "https://gh-proxy.com/https://github.com/pkufool/cppinyin/archive/refs/tags/v0.10.tar.gz")

   set(cppinyin_HASH "SHA256=abe6584d7ee56829e8f4b5fbda3b50ecdf49a13be8e413a78d1b0d5d5c019982")
CMakeLists.txt (1)

369-369: Gate cppinyin fetch behind SHERPA_ONNX_ENABLE_TTS and call the function explicitly

Including cppinyin unconditionally triggers a fetch even for ASR-only builds. Move it into the TTS block and call download_cppinyin() there (remove the self-call inside the module per its review).

Patch:

@@
-include(cppinyin)
 include(simple-sentencepiece)
@@
 if(SHERPA_ONNX_ENABLE_TTS)
   message(STATUS "TTS is enabled")
   add_definitions(-DSHERPA_ONNX_ENABLE_TTS=1)
+  include(cppinyin)
+  download_cppinyin()
 else()
sherpa-onnx/python/sherpa_onnx/__init__.py (1)

52-52: Re-export is fine; silence Ruff F401 for init or define all.

The new symbol is intentionally re-exported from the extension module, which triggers Ruff F401 in this file. Either add a per-file directive or define all to make the intent explicit.

Apply one of the following minimal changes (preferred: directive):

Option A — add a per-file Ruff directive at the top of this file:

+# ruff: noqa: F401  # re-exports from extension module are public API
 from _sherpa_onnx import (

Option B — explicitly export via all (trim list as needed):

+__all__ = [
+    # ... keep existing, add new:
+    "OfflineTtsZipvoiceModelConfig",
+]
sherpa-onnx/csrc/offline-tts-model-config.cc (1)

42-44: Broaden ZipVoice selection sentinel to catch partial configs.

Using only flow_matching_model as the sentinel may miss cases where text_model is set but flow_matching_model is accidentally omitted, leading to a generic “provide exactly one tts model” error instead of zipvoice.Validate()’s targeted diagnostics. Recommend triggering ZipVoice validation if either core path is set.

Apply this diff:

-  if (!zipvoice.flow_matching_model.empty()) {
+  if (!zipvoice.flow_matching_model.empty() || !zipvoice.text_model.empty()) {
     return zipvoice.Validate();
   }

If docs/CLI examples instruct users to set text_model first, this change prevents confusing errors. Please verify other places in the codebase (parsers, examples) align with this sentinel.

sherpa-onnx/csrc/vocos-vocoder.cc (4)

45-55: Use SHERPA_ONNX_EXIT and guard against empty buffer before Init.

  • Prefer SHERPA_ONNX_EXIT for consistency with other error paths and platform-specific logging.
  • If ReadFile fails and returns an empty buffer, Init with length 0 will produce a less-informative runtime error. Fail fast with a clear message.

Apply this diff:

-    std::vector<char> buffer;
+    std::vector<char> buffer;
     if (!config.matcha.vocoder.empty()) {
       buffer = ReadFile(config.matcha.vocoder);
     } else if (!config.zipvoice.vocoder.empty()) {
       buffer = ReadFile(config.zipvoice.vocoder);
     } else {
-      SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
-      exit(-1);
+      SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
+      SHERPA_ONNX_EXIT(-1);
     }
-    Init(buffer.data(), buffer.size());
+    if (buffer.empty()) {
+      SHERPA_ONNX_LOGE("Vocoder file read returned empty buffer.");
+      SHERPA_ONNX_EXIT(-1);
+    }
+    Init(buffer.data(), buffer.size());

63-73: Mirror the same error-handling improvements for Manager-based ctor.

Keep both constructors in sync for consistent behavior across platforms.

Apply this diff:

-    std::vector<char> buffer;
+    std::vector<char> buffer;
     if (!config.matcha.vocoder.empty()) {
       buffer = ReadFile(mgr, config.matcha.vocoder);
     } else if (!config.zipvoice.vocoder.empty()) {
       buffer = ReadFile(mgr, config.zipvoice.vocoder);
     } else {
-      SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
-      exit(-1);
+      SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
+      SHERPA_ONNX_EXIT(-1);
     }
-    Init(buffer.data(), buffer.size());
+    if (buffer.empty()) {
+      SHERPA_ONNX_LOGE("Vocoder file read returned empty buffer.");
+      SHERPA_ONNX_EXIT(-1);
+    }
+    Init(buffer.data(), buffer.size());

39-45: Minor: remove unused allocator_ member (and its init) to reduce noise.

allocator_ is never referenced; there is a local allocator in Init(). Safe to drop.

Apply these diffs:

Constructor initializer lists:

-        sess_opts_(GetSessionOptions(config.num_threads, config.provider)),
-        allocator_{} {
+        sess_opts_(GetSessionOptions(config.num_threads, config.provider)) {
-        sess_opts_(GetSessionOptions(config.num_threads, config.provider)),
-        allocator_{} {
+        sess_opts_(GetSessionOptions(config.num_threads, config.provider)) {

Private member:

-  Ort::AllocatorWithDefaultOptions allocator_;

45-55: Avoid duplicated selection logic (optional).

Both ctors duplicate model selection. Consider extracting a small helper (or lambda) to choose and read the vocoder once, reducing drift risk.

Also applies to: 63-73

sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (3)

8-10: Remove unused include to keep headers lean

std::string is not used in this header. Dropping it reduces compile surface area.

-#include <cstdint>
-#include <string>
+#include <cstdint>

13-15: Tighten up the comment wording

Minor grammar polish for clarity.

-// If you are not sure what each field means, please
-//
-// have a look of the Python file in the model directory that
-//
-// you have downloaded.
+// If you are not sure what each field means, please
+// look at the Python file in the model directory that
+// you have downloaded.

24-26: Use booleans for boolean flags

use_espeak and use_pinyin are consumed as boolean toggles elsewhere. Using bool communicates intent and prevents accidental non-0/1 assignments.

Please confirm no ABI/serialization expectations require these to be int32_t.

-  int32_t use_espeak = 1;
-  int32_t use_pinyin = 1;
+  bool use_espeak = true;
+  bool use_pinyin = true;

As follow-up, usages like if (meta_data_.use_pinyin) and designated initializers in tests already align with bool.

sherpa-onnx/csrc/offline-tts.cc (1)

199-237: More robust conversion fallback and log scoping (optional)

If GB2312 conversion fails (returns empty), the code will forward an empty string. Consider guarding conversions and scoping logs per-argument.

  • Only replace with the converted string if conversion produced non-empty.
  • Keep separate “printed” flags for text and prompt_text, or log unconditionally at debug level.
sherpa-onnx/csrc/offline-tts-zipvoice-frontend-test.cc (3)

18-22: Use GTEST_SKIP() instead of return for missing test assets

Returning early marks the test as passed, hiding skipped status in CI. Prefer GTEST_SKIP() to make intent explicit and visible.

-  if (!FileExists(data_dir + "/en_dict")) {
-    SHERPA_ONNX_LOGE("%s/en_dict does not exist. Skipping test",
-                     data_dir.c_str());
-    return;
-  }
+  if (!FileExists(data_dir + "/en_dict")) {
+    SHERPA_ONNX_LOGE("%s/en_dict does not exist. Skipping test",
+                     data_dir.c_str());
+    GTEST_SKIP();
+  }

-  if (!FileExists(data_dir + "/phontab")) {
+  if (!FileExists(data_dir + "/phontab")) {
     ...
-    return;
+    GTEST_SKIP();
   }

-  if (!FileExists(data_dir + "/phonindex")) {
+  if (!FileExists(data_dir + "/phonindex")) {
     ...
-    return;
+    GTEST_SKIP();
   }

-  if (!FileExists(data_dir + "/phondata")) {
+  if (!FileExists(data_dir + "/phondata")) {
     ...
-    return;
+    GTEST_SKIP();
   }

-  if (!FileExists(data_dir + "/intonations")) {
+  if (!FileExists(data_dir + "/intonations")) {
     ...
-    return;
+    GTEST_SKIP();
   }

Also applies to: 24-28, 30-34, 36-40, 42-46


48-58: Skip with GTEST when dictionary/token files are absent

Same rationale as above for pinyin_dict and tokens_file.

-  if (!FileExists(pinyin_dict)) {
-    SHERPA_ONNX_LOGE("%s does not exist. Skipping test", pinyin_dict.c_str());
-    return;
-  }
+  if (!FileExists(pinyin_dict)) {
+    SHERPA_ONNX_LOGE("%s does not exist. Skipping test", pinyin_dict.c_str());
+    GTEST_SKIP();
+  }
...
-  if (!FileExists(tokens_file)) {
-    SHERPA_ONNX_LOGE("%s does not exist. Skipping test", tokens_file.c_str());
-    return;
-  }
+  if (!FileExists(tokens_file)) {
+    SHERPA_ONNX_LOGE("%s does not exist. Skipping test", tokens_file.c_str());
+    GTEST_SKIP();
+  }

65-76: Add minimal assertions so the test validates behavior

Currently the test only exercises code. Add basic sanity checks to ensure tokenization produced content.

   std::string text = "how are you doing?";
   std::vector<sherpa_onnx::TokenIDs> ans =
       frontend.ConvertTextToTokenIds(text, "en-us");
+  EXPECT_FALSE(ans.empty());
+  EXPECT_FALSE(ans[0].tokens.empty());

   text = "这是第一句。这是第二句。";
   ans = frontend.ConvertTextToTokenIds(text, "en-us");
+  EXPECT_FALSE(ans.empty());
+  EXPECT_FALSE(ans[0].tokens.empty());

   text =
       "这是第一句。这是第二句。<pin1><yin2>测试 [S1]and hello "
       "world[S2]这是第三句。";
   ans = frontend.ConvertTextToTokenIds(text, "en-us");
+  EXPECT_FALSE(ans.empty());
+  EXPECT_FALSE(ans[0].tokens.empty());
sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (1)

62-66: Clarify flag descriptions

Small wording improvements for clarity and accuracy.

-  po.Register("prompt-text", &prompt_text, "The transcribe of prompt_samples.");
+  po.Register("prompt-text", &prompt_text,
+              "The transcription of the prompt audio.");
 
   po.Register("prompt-audio", &prompt_audio,
-              "The prompt audio file, single channel pcm. ");
+              "Path to the prompt audio file (mono, 16-bit PCM WAV).");
sherpa-onnx/csrc/offline-tts-impl.cc (1)

45-48: Strengthen ZipVoice selection guard (also require tokens)

Create() selects ZipVoice when text_model and flow_matching_model are set. If tokens is missing, frontend initialization will later fail. Either rely on Validate() being called before Create(), or add a lightweight guard on tokens here to fail fast.

Apply this diff to both overloads:

-} else if (!config.model.zipvoice.text_model.empty() &&
-           !config.model.zipvoice.flow_matching_model.empty()) {
+} else if (!config.model.zipvoice.text_model.empty() &&
+           !config.model.zipvoice.flow_matching_model.empty() &&
+           !config.model.zipvoice.tokens.empty()) {
     return std::make_unique<OfflineTtsZipvoiceImpl>(config);
-} else if (!config.model.zipvoice.text_model.empty() &&
-           !config.model.zipvoice.flow_matching_model.empty()) {
+} else if (!config.model.zipvoice.text_model.empty() &&
+           !config.model.zipvoice.flow_matching_model.empty() &&
+           !config.model.zipvoice.tokens.empty()) {
     return std::make_unique<OfflineTtsZipvoiceImpl>(mgr, config);

Confirm that OfflineTtsModelConfig::Validate() enforces tokens presence for ZipVoice so we don’t duplicate checks.

Also applies to: 67-69

sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)

26-31: Fix tensor shape doc to match implementation

Impl returns a tensor shaped {batch_size, keep_frames, feat_dim}, where feat_dim == meta_data_.feat_dim. The comment currently states (batch_size, mel_dim, num_frames), which is inverted and uses different naming.

Apply this diff to correct the documentation:

-  // Return a float32 tensor containing the mel
-  // of shape (batch_size, mel_dim, num_frames)
+  // Return a float32 mel-spectrogram-like tensor
+  // of shape (batch_size, num_frames, feat_dim).
+  // Note: feat_dim == meta_data_.feat_dim; num_frames excludes the prompt frames.
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)

49-57: Consider making punct_map_ static to avoid per-instance duplication

punct_map_ is immutable. Making it static constexpr (or function-static) avoids re-creating the map per instance and reduces binary size.

Example:

-  const std::unordered_map<std::string, std::string> punct_map_ = {
+  static const std::unordered_map<std::string, std::string> punct_map_ = {
       {",", ","}, {"。", "."}, {"!", "!"},  {"?", "?"},     {";", ";"},
       {":", ":"}, {"、", ","}, {"‘", "'"},   {"“", "\""},     {"”", "\""},
       {"’", "'"},  {"⋯", "…"},  {"···", "…"}, {"・・・", "…"}, {"...", "…"}};
sherpa-onnx/python/csrc/offline-tts.cc (1)

89-122: New overload looks solid; consider clarifying/optimizing argument passing

  • API shape and defaults (speed=1.0, num_steps=4) match the C++ side. The GIL is released correctly.
  • Minor: passing prompt_samples as std::vector causes an extra copy from NumPy. If you want to micro-opt, accept py::array_t and materialize a std::vector only once just before calling into C++ (optional).

Example alternative signature (for future consideration):

-             const std::vector<float> &prompt_samples, int32_t sample_rate,
+             const py::array_t<float> &prompt_samples, int32_t sample_rate,

and then convert to std::vector once before invoking self.Generate.

python-api-examples/offline-zeroshot-tts.py (3)

178-182: Use store_true for boolean flags

Argparse with type=bool is brittle (e.g., --debug false still parses as True). Prefer action="store_true".

-    parser.add_argument(
-        "--debug",
-        type=bool,
-        default=False,
-        help="True to show debug messages",
-    )
+    parser.add_argument(
+        "--debug",
+        action="store_true",
+        help="Show debug messages",
+    )

121-143: Prefer explicit errors over asserts in read_wave()

Asserts can be stripped and yield confusing failures. Raise ValueError with clear messages so users understand channel/bit-depth requirements.

-    with wave.open(wave_filename) as f:
-        assert f.getnchannels() == 1, f.getnchannels()
-        assert f.getsampwidth() == 2, f.getsampwidth()  # it is in bytes
+    with wave.open(wave_filename) as f:
+        if f.getnchannels() != 1:
+            raise ValueError(f"Expected mono WAV. Got {f.getnchannels()} channels")
+        if f.getsampwidth() != 2:
+            raise ValueError(f"Expected 16-bit PCM WAV. Got sample width={f.getsampwidth()} bytes")

257-266: Optional: demonstrate progress callback usage

Since the Python binding supports a progress callback, add an example toggle in the script to showcase streaming writes or early-stop behavior. Keeps the example self-explanatory for advanced users.

Happy to provide a small snippet if you’d like.

sherpa-onnx/csrc/offline-tts-impl.h (1)

12-12: Include if throwing exceptions from default methods

If you choose the throwing approach above, add the header.

 #include <vector>
 
-#include "sherpa-onnx/csrc/macros.h"
+#include "sherpa-onnx/csrc/macros.h"
+#include <stdexcept>
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (1)

83-98: Consider extracting required espeak-ng files to a constant.

The list of required espeak-ng files is hardcoded within the validation method. For better maintainability and potential reuse, consider extracting this to a static constant or a separate function.

+  static const std::vector<std::string> kRequiredEspeakFiles = {
+      "phontab",
+      "phonindex",
+      "phondata",
+      "intonations",
+  };
+
   if (!data_dir.empty()) {
-    std::vector<std::string> required_files = {
-        "phontab",
-        "phonindex",
-        "phondata",
-        "intonations",
-    };
-    for (const auto &f : required_files) {
+    for (const auto &f : kRequiredEspeakFiles) {
       if (!FileExists(data_dir + "/" + f)) {
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1)

21-23: Consider clarifying the comment about lexicon behavior.

The comment states "If data_dir is given, lexicon is ignored" but there's no mention of a lexicon field in this struct. This might be confusing for users.

-  // If data_dir is given, lexicon is ignored
-  // data_dir is for piper-phonemize, which uses espeak-ng
+  // data_dir is for piper-phonemize, which uses espeak-ng
+  // When provided, espeak-ng will be used for phonemization
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (1)

121-121: Consider making tone parameter configurable.

The tone parameter is hardcoded as "number". Consider making it configurable or documenting why this specific format is used.

sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (2)

106-108: Use cryptographically secure random number generation.

Using std::random_device{}() as a seed may not provide sufficient randomness on all platforms. Some implementations return a fixed value if hardware randomness is unavailable.

Consider using a more robust seeding approach:

-    std::default_random_engine rng(std::random_device{}());
+    std::random_device rd;
+    std::seed_seq seed{rd(), rd(), rd(), rd()};
+    std::mt19937 rng(seed);

167-175: Potential integer overflow in memory calculations.

The multiplication batch_size * keep_frames * feat_dim could overflow for large values. While batch_size is checked to be 1, the other values come from the model and user input.

Consider adding overflow checks or using size_t for size calculations:

-    std::vector<float> out_data(batch_size * keep_frames * feat_dim);
+    size_t out_size = static_cast<size_t>(batch_size) * keep_frames * feat_dim;
+    if (out_size > std::numeric_limits<size_t>::max() / sizeof(float)) {
+      SHERPA_ONNX_LOGE("Output size too large");
+      exit(-1);
+    }
+    std::vector<float> out_data(out_size);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 06ae4a7 and e0ddc39.

📒 Files selected for processing (33)
  • CMakeLists.txt (1 hunks)
  • cmake/cppinyin.cmake (1 hunks)
  • python-api-examples/offline-tts.py (3 hunks)
  • python-api-examples/offline-zeroshot-tts.py (1 hunks)
  • sherpa-onnx/csrc/CMakeLists.txt (5 hunks)
  • sherpa-onnx/csrc/offline-tts-impl.cc (3 hunks)
  • sherpa-onnx/csrc/offline-tts-impl.h (2 hunks)
  • sherpa-onnx/csrc/offline-tts-matcha-impl.h (1 hunks)
  • sherpa-onnx/csrc/offline-tts-model-config.cc (3 hunks)
  • sherpa-onnx/csrc/offline-tts-model-config.h (3 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-frontend-test.cc (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1 hunks)
  • sherpa-onnx/csrc/offline-tts.cc (1 hunks)
  • sherpa-onnx/csrc/offline-tts.h (1 hunks)
  • sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (1 hunks)
  • sherpa-onnx/csrc/text-utils-test.cc (4 hunks)
  • sherpa-onnx/csrc/text-utils.cc (1 hunks)
  • sherpa-onnx/csrc/text-utils.h (1 hunks)
  • sherpa-onnx/csrc/vocoder.cc (2 hunks)
  • sherpa-onnx/csrc/vocos-vocoder.cc (2 hunks)
  • sherpa-onnx/python/csrc/CMakeLists.txt (1 hunks)
  • sherpa-onnx/python/csrc/offline-tts-model-config.cc (2 hunks)
  • sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (1 hunks)
  • sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.h (1 hunks)
  • sherpa-onnx/python/csrc/offline-tts.cc (1 hunks)
  • sherpa-onnx/python/sherpa_onnx/__init__.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-06T04:23:50.237Z
Learnt from: litongjava
PR: k2-fsa/sherpa-onnx#2440
File: sherpa-onnx/java-api/src/main/java/com/k2fsa/sherpa/onnx/core/Core.java:4-6
Timestamp: 2025-08-06T04:23:50.237Z
Learning: The sherpa-onnx JNI library files are stored in Hugging Face repository at https://huggingface.co/csukuangfj/sherpa-onnx-libs under versioned directories like jni/1.12.7/, and the actual Windows JNI library filename is "sherpa-onnx-jni.dll" as defined in Core.java constants.

Applied to files:

  • sherpa-onnx/csrc/CMakeLists.txt
📚 Learning: 2025-08-06T04:18:47.981Z
Learnt from: litongjava
PR: k2-fsa/sherpa-onnx#2440
File: sherpa-onnx/java-api/src/main/java/com/k2fsa/sherpa/onnx/core/Core.java:4-6
Timestamp: 2025-08-06T04:18:47.981Z
Learning: In sherpa-onnx Java API, the native library names in Core.java (WIN_NATIVE_LIBRARY_NAME = "sherpa-onnx-jni.dll", UNIX_NATIVE_LIBRARY_NAME = "libsherpa-onnx-jni.so", MACOS_NATIVE_LIBRARY_NAME = "libsherpa-onnx-jni.dylib") are copied directly from the compiled binary filenames and should not be changed to match other libraries' naming conventions.

Applied to files:

  • sherpa-onnx/csrc/CMakeLists.txt
🧬 Code graph analysis (22)
sherpa-onnx/csrc/vocos-vocoder.cc (2)
nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_en.js (1)
  • config (8-22)
nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_zh.js (1)
  • config (8-24)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend-test.cc (2)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (4)
  • OfflineTtsZipvoiceFrontend (199-217)
  • OfflineTtsZipvoiceFrontend (220-241)
  • OfflineTtsZipvoiceFrontend (367-370)
  • OfflineTtsZipvoiceFrontend (375-378)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
  • OfflineTtsZipvoiceFrontend (18-57)
python-api-examples/offline-zeroshot-tts.py (2)
python-api-examples/offline-tts.py (2)
  • get_args (322-401)
  • main (404-477)
sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (3)
  • main (19-147)
  • main (19-19)
  • tts (100-100)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (6)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)
  • sherpa_onnx (26-90)
sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (1)
  • sherpa_onnx (11-28)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.h (1)
  • sherpa_onnx (10-14)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
  • sherpa_onnx (16-59)
sherpa-onnx/csrc/offline-tts-model-config.h (1)
  • sherpa_onnx (17-54)
sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (2)
  • tokens (58-180)
  • tokens (58-59)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.h (2)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1)
  • sherpa_onnx (13-60)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (2)
  • PybindOfflineTtsZipvoiceModelConfig (13-41)
  • PybindOfflineTtsZipvoiceModelConfig (13-13)
sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (3)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)
  • sherpa_onnx (26-90)
sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)
  • sherpa_onnx (15-39)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
  • sherpa_onnx (16-59)
sherpa-onnx/csrc/vocoder.cc (2)
nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_en.js (1)
  • config (8-22)
nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_zh.js (1)
  • config (8-24)
sherpa-onnx/csrc/offline-tts.cc (1)
sherpa-onnx/csrc/text-utils.cc (6)
  • IsGB2312 (635-655)
  • IsGB2312 (635-635)
  • Gb2312ToUtf8 (658-686)
  • Gb2312ToUtf8 (658-658)
  • IsUtf8 (570-633)
  • IsUtf8 (570-570)
sherpa-onnx/csrc/offline-tts.h (2)
sherpa-onnx/c-api/cxx-api.cc (2)
  • Generate (458-476)
  • Generate (458-461)
sherpa-onnx/csrc/offline-tts.cc (4)
  • Generate (173-197)
  • Generate (173-175)
  • Generate (199-237)
  • Generate (199-203)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (2)
sherpa-onnx/csrc/offline-tts-model-config.cc (6)
  • Register (11-26)
  • Register (11-11)
  • Validate (28-57)
  • Validate (28-28)
  • ToString (59-73)
  • ToString (59-59)
sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (2)
  • tokens (58-180)
  • tokens (58-59)
sherpa-onnx/csrc/text-utils.h (1)
sherpa-onnx/csrc/text-utils.cc (16)
  • Utf8ToUtf32 (727-730)
  • Utf8ToUtf32 (727-727)
  • Utf32ToUtf8 (732-735)
  • Utf32ToUtf8 (732-732)
  • ToUpperAscii (739-744)
  • ToUpperAscii (739-739)
  • ToLowerAscii (748-753)
  • ToLowerAscii (748-748)
  • IsCJK (756-761)
  • IsCJK (756-756)
  • ContainsCJK (763-766)
  • ContainsCJK (763-763)
  • ContainsCJK (768-775)
  • ContainsCJK (768-768)
  • StringToBool (777-790)
  • StringToBool (777-777)
sherpa-onnx/csrc/offline-tts-zipvoice-model.h (5)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)
  • sherpa_onnx (26-90)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1)
  • sherpa_onnx (13-60)
sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (1)
  • sherpa_onnx (11-28)
sherpa-onnx/csrc/offline-tts-model-config.h (1)
  • sherpa_onnx (17-54)
sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (14)
  • OfflineTtsZipvoiceModel (287-289)
  • OfflineTtsZipvoiceModel (292-294)
  • OfflineTtsZipvoiceModel (296-296)
  • OfflineTtsZipvoiceModel (313-314)
  • OfflineTtsZipvoiceModel (318-319)
  • Run (303-310)
  • Run (303-307)
  • tokens (58-180)
  • tokens (58-59)
  • GetMetaData (298-301)
  • GetMetaData (298-299)
  • Impl (33-41)
  • Impl (33-33)
  • Impl (44-52)
sherpa-onnx/csrc/offline-tts-impl.h (3)
harmony-os/SherpaOnnxHar/sherpa_onnx/src/main/cpp/non-streaming-tts.cc (1)
  • callback (469-496)
scripts/go/sherpa_onnx.go (1)
  • GeneratedAudio (963-968)
sherpa-onnx/csrc/offline-tts.cc (4)
  • SampleRate (239-239)
  • SampleRate (239-239)
  • NumSpeakers (241-241)
  • NumSpeakers (241-241)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (6)
sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)
  • OfflineTtsZipvoiceModel (17-37)
sherpa-onnx/csrc/offline-tts-impl.h (1)
  • OfflineTtsImpl (17-55)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
  • OfflineTtsZipvoiceFrontend (18-57)
sherpa-onnx/csrc/vocoder.cc (6)
  • Create (76-103)
  • Create (76-76)
  • Create (106-133)
  • Create (106-107)
  • Create (136-137)
  • Create (141-142)
sherpa-onnx/csrc/vocoder.h (1)
  • Vocoder (17-31)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (4)
  • OfflineTtsZipvoiceFrontend (199-217)
  • OfflineTtsZipvoiceFrontend (220-241)
  • OfflineTtsZipvoiceFrontend (367-370)
  • OfflineTtsZipvoiceFrontend (375-378)
sherpa-onnx/csrc/text-utils-test.cc (1)
sherpa-onnx/csrc/text-utils.cc (13)
  • RemoveInvalidUtf8Sequences (486-568)
  • RemoveInvalidUtf8Sequences (486-487)
  • SplitUtf8 (357-393)
  • SplitUtf8 (357-357)
  • ToWideString (689-694)
  • ToWideString (689-689)
  • i (113-113)
  • ToString (696-701)
  • ToString (696-696)
  • ContainsCJK (763-766)
  • ContainsCJK (763-763)
  • ContainsCJK (768-775)
  • ContainsCJK (768-768)
sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (3)
sherpa-onnx/csrc/ten-vad-model.cc (1)
  • meta_data (257-314)
sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)
  • OfflineTtsZipvoiceModel (17-37)
sherpa-onnx/csrc/vocos-vocoder.cc (2)
  • Run (196-198)
  • Run (196-196)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (4)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (4)
  • ToString (139-156)
  • ToString (139-139)
  • Validate (42-137)
  • Validate (42-42)
sherpa-onnx/python/csrc/offline-tts-kokoro-model-config.cc (1)
  • PybindOfflineTtsKokoroModelConfig (11-38)
sherpa-onnx/python/csrc/offline-tts-vits-model-config.cc (1)
  • PybindOfflineTtsVitsModelConfig (11-37)
sherpa-onnx/python/csrc/offline-tts-matcha-model-config.cc (1)
  • PybindOfflineTtsMatchaModelConfig (11-37)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (3)
sherpa-onnx/csrc/text-utils.cc (16)
  • is (785-785)
  • ToWideString (689-694)
  • ToWideString (689-689)
  • ToString (696-701)
  • ToString (696-696)
  • Utf32ToUtf8 (732-735)
  • Utf32ToUtf8 (732-732)
  • ToUpperAscii (739-744)
  • ToUpperAscii (739-739)
  • ToLowerAscii (748-753)
  • ToLowerAscii (748-748)
  • i (113-113)
  • ContainsCJK (763-766)
  • ContainsCJK (763-763)
  • ContainsCJK (768-775)
  • ContainsCJK (768-768)
sherpa-onnx/csrc/piper-phonemize-lexicon.cc (12)
  • is (312-312)
  • is (341-341)
  • is (353-353)
  • is (365-365)
  • ToString (36-59)
  • ToString (36-36)
  • CallPhonemizeEspeak (61-70)
  • CallPhonemizeEspeak (61-63)
  • InitEspeak (277-305)
  • InitEspeak (277-277)
  • ConvertTextToTokenIds (423-436)
  • ConvertTextToTokenIds (423-424)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
  • OfflineTtsZipvoiceFrontend (18-57)
sherpa-onnx/python/csrc/offline-tts-model-config.cc (1)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (2)
  • PybindOfflineTtsZipvoiceModelConfig (13-41)
  • PybindOfflineTtsZipvoiceModelConfig (13-13)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (6)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)
  • sherpa_onnx (26-90)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1)
  • sherpa_onnx (13-60)
sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (1)
  • sherpa_onnx (11-28)
sherpa-onnx/csrc/text-utils.h (1)
  • sherpa_onnx (25-175)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (6)
  • OfflineTtsZipvoiceFrontend (199-217)
  • OfflineTtsZipvoiceFrontend (220-241)
  • OfflineTtsZipvoiceFrontend (367-370)
  • OfflineTtsZipvoiceFrontend (375-378)
  • ConvertTextToTokenIds (243-364)
  • ConvertTextToTokenIds (243-244)
sherpa-onnx/csrc/offline-tts-frontend.h (1)
  • OfflineTtsFrontend (38-56)
sherpa-onnx/csrc/text-utils.cc (1)
sherpa-onnx/csrc/kokoro-multi-lang-lexicon.cc (14)
  • text (171-180)
  • text (171-171)
  • text (204-259)
  • text (204-205)
  • text (261-273)
  • text (261-262)
  • text (275-403)
  • text (275-276)
  • is (406-406)
  • is (418-435)
  • is (418-418)
  • is (445-445)
  • is (466-506)
  • is (466-466)
sherpa-onnx/python/csrc/offline-tts.cc (4)
harmony-os/SherpaOnnxHar/sherpa_onnx/src/main/cpp/non-streaming-tts.cc (1)
  • callback (469-496)
android/SherpaOnnxTts/app/src/main/java/com/k2fsa/sherpa/onnx/MainActivity.kt (1)
  • callback (98-106)
kotlin-api-examples/test_tts.kt (1)
  • callback (157-163)
android/SherpaOnnxTtsEngine/app/src/main/java/com/k2fsa/sherpa/onnx/tts/engine/MainActivity.kt (1)
  • callback (311-323)
🪛 Ruff (0.12.2)
sherpa-onnx/python/sherpa_onnx/__init__.py

52-52: _sherpa_onnx.OfflineTtsZipvoiceModelConfig imported but unused

(F401)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: Release shared-OFF tts-OFF
  • GitHub Check: Release shared-ON tts-OFF
  • GitHub Check: Debug shared-ON tts-OFF
  • GitHub Check: ubuntu-24.04 3.13
  • GitHub Check: ubuntu-24.04 3.9
  • GitHub Check: ubuntu-24.04 3.11
  • GitHub Check: ubuntu-24.04 3.8
  • GitHub Check: ubuntu-24.04 3.12
  • GitHub Check: ubuntu-24.04 3.10
  • GitHub Check: ubuntu-latest Debug static tts-OFF
  • GitHub Check: ubuntu-latest Debug shared tts-ON
  • GitHub Check: ubuntu-latest Debug static tts-ON
  • GitHub Check: ubuntu-latest Debug shared tts-OFF
  • GitHub Check: ubuntu-latest Release static tts-ON
  • GitHub Check: ubuntu-latest Release static tts-OFF
  • GitHub Check: ubuntu-latest Release shared tts-OFF
  • GitHub Check: ubuntu-latest Release shared tts-ON
  • GitHub Check: rknn shared ON
  • GitHub Check: rknn shared OFF
  • GitHub Check: swift (macos-13)
🔇 Additional comments (23)
python-api-examples/offline-tts.py (2)

3-3: Copyright year range update — LGTM

Docs-only change; no functional impact.


475-477: RTF print reflow — LGTM

Pure formatting; same behavior.

sherpa-onnx/csrc/offline-tts-matcha-impl.h (1)

143-145: Whitespace-only tidy — LGTM

No behavioral change; thanks for keeping the file clean.

sherpa-onnx/python/csrc/CMakeLists.txt (1)

75-75: Offline TTS ZipVoice model‐config binding verified

  • Found offline-tts-zipvoice-model-config.cc and .h under sherpa-onnx/python/csrc.
  • PybindOfflineTtsZipvoiceModelConfig is defined and invoked in the binding sources.
  • OfflineTtsZipvoiceModelConfig is correctly re-exported in sherpa_onnx/__init__.py.

LGTM!

sherpa-onnx/csrc/offline-tts-model-config.cc (2)

15-16: LGTM: ZipVoice options registered.

Registration ordering alongside vits/matcha/kokoro/kitten looks consistent.


66-66: LGTM: ToString includes ZipVoice section.

Helpful for debugging full model configuration dumps.

sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.h (1)

1-16: LGTM: Clean binding declaration and guard.

Header matches existing binding patterns; include path and guard naming are consistent.

sherpa-onnx/csrc/vocoder.cc (1)

77-88: Outdated file reference: no vocoder.cc in the codebase

I couldn’t find any file at sherpa-onnx/csrc/vocoder.cc, nor any occurrences of
config.matcha.vocoder or config.zipvoice.vocoder in the repository.
The only “vocoder”‐related implementation is in hifigan-vocoder.cc, which doesn’t contain that snippet.

Please update or remove this review comment, since it refers to a file and code that no longer exist.

Likely an incorrect or invalid review comment.

sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (1)

102-111: No additional sample‐rate guard needed

The ZipVoice backend already handles prompt audio sample-rate mismatches by resampling internally in ComputeMelSpectrogram (using LinearResample when sample_rate != meta.sample_rate in offline-tts-zipvoice-impl.h around lines 123–138). Consequently, the wrapper in sherpa-onnx-offline-zeroshot-tts.cc does not require an explicit check or manual resampling.

sherpa-onnx/csrc/CMakeLists.txt (4)

204-206: ZipVoice sources wired into TTS build path — looks good

New ZipVoice sources are properly gated under SHERPA_ONNX_ENABLE_TTS and appended to sherpa-onnx-core sources. No issues spotted here.


355-356: New binary sherpa-onnx-offline-zeroshot-tts integrated correctly

Executable is added under the TTS gate and appended to main_exes for uniform linking, rpath setup, and installation. This matches existing patterns for other binaries.

Also applies to: 377-379


652-656: Good addition of ZipVoice frontend test to the test suite

The new test is correctly appended under SHERPA_ONNX_ENABLE_TTS and will be compiled and registered via sherpa_onnx_add_test. Looks consistent with existing tests.


270-275: Unconditional linking of cppinyin_core is safe; no guard needed

The root CMakeLists.txt always includes the cppinyin module (around line 369) regardless of the SHERPA_ONNX_ENABLE_TTS setting. That module’s download_cppinyin() function unconditionally defines and populates the cppinyin_core target, so it’s always available for linking—TTS being disabled does not remove or skip its inclusion. You can safely leave the existing target_link_libraries(sherpa-onnx-core cppinyin_core …) as is.

Likely an incorrect or invalid review comment.

sherpa-onnx/csrc/offline-tts-model-config.h (1)

14-14: ZipVoice added to OfflineTtsModelConfig — ensure Register/Validate/ToString are updated

Struct wiring looks consistent (member, include, ctor signature, and init). Please ensure:

  • Register(ParseOptions*): flags for all ZipVoice fields (tokens, text_model, flow_matching_model, vocoder, data_dir, pinyin_dict, num_steps, feat_scale, t_shift, target_rms, guidance_scale).
  • Validate(): enforces required fields for ZipVoice and provides actionable error messages. Also consider guarding multiple-backend selection (e.g., return error if more than one of vits/matcha/kokoro/zipvoice/kitten is configured).
  • ToString(): includes the ZipVoice fields and redacts sensitive paths if that’s the convention.

Optionally, add explicit mutual-exclusion validation to prevent ambiguous backend selection. Example (illustrative, adjust to your style):

bool OfflineTtsModelConfig::Validate() const {
  int active = 0;
  active += !vits.model.empty();
  active += !matcha.acoustic_model.empty();
  active += !kokoro.model.empty();
  active += !zipvoice.text_model.empty() && !zipvoice.flow_matching_model.empty();
  active += !kitten.model.empty();

  if (active != 1) {
    SHERPA_ONNX_LOGE("Exactly one TTS backend must be configured; got %d.", active);
    return false;
  }

  // ZipVoice required fields
  if (!zipvoice.text_model.empty() || !zipvoice.flow_matching_model.empty()) {
    if (zipvoice.tokens.empty()) {
      SHERPA_ONNX_LOGE("ZipVoice requires --zipvoice-tokens.");
      return false;
    }
  }
  // ... other backend-specific checks ...
  return true;
}

Also applies to: 23-25, 35-36, 42-43

sherpa-onnx/csrc/offline-tts-impl.cc (1)

23-23: Include for ZipVoice impl — OK

Including offline-tts-zipvoice-impl.h is correct and localizes the new backend wiring here.

sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)

23-25: I’ve added searches to locate and inspect the .cc file for any out-of-line template definitions or explicit instantiations. Once the results are in, I can confirm whether the ctor needs to be moved inline or if dedicated template <> instantiations should be added.

sherpa-onnx/python/csrc/offline-tts.cc (1)

87-88: Good call: release the GIL around the long-running generate() call

Adding py::call_guardpy::gil_scoped_release() prevents blocking the interpreter during synthesis. Looks correct.

sherpa-onnx/python/csrc/offline-tts-model-config.cc (2)

14-23: Registration order is fine; ensure ZipVoice is initialized before use

You correctly register PybindOfflineTtsZipvoiceModelConfig(m) before exposing it on OfflineTtsModelConfig. No issues spotted.


27-41: No positional OfflineTtsModelConfig calls detected

I ran the suggested grep over all Python and C/C++ files for calls to OfflineTtsModelConfig(...) without keyword arguments. The output only included the class’s own definition and binding code—no client code passes positional arguments. It’s safe to keep the new zipvoice parameter ordering without breaking existing calls.

sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (1)

16-41: Bindings look consistent with C++ config; minor polish only

Constructors, defaults, and attributes match the C++ type and Validate/ToString methods. Good to go.

sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (1)

139-156: LGTM! Clean string representation implementation.

The ToString() method properly formats all configuration fields with appropriate quoting for string fields.

sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (2)

52-53: Fix inconsistent parameter name between declaration and usage.

The parameter is named num_steps in the declaration but used as num_steps in the implementation, which is correct. However, ensure consistency throughout.


297-302: Incorrect conditional logic for RMS scaling.

The condition checks if prompt_rms < target_rms but also checks target_rms > 0.0f. The second condition should check prompt_rms > 0.0f to avoid division by zero when calculating the scale factor.

-  if (prompt_rms < target_rms && target_rms > 0.0f) {
+  if (prompt_rms < target_rms && prompt_rms > 0.0f) {

Likely an incorrect or invalid review comment.

Comment on lines +116 to +117
help="The scale classifier-free guidance during inference for for Zipvoice.",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix user-facing typo in help text (“for for”)

Typo in CLI help: “for for Zipvoice.” → “for Zipvoice.” This shows up in --help.

Apply:

-        help="The scale classifier-free guidance during inference for for Zipvoice.",
+        help="The scale of classifier-free guidance during inference for Zipvoice.",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
help="The scale classifier-free guidance during inference for for Zipvoice.",
)
help="The scale of classifier-free guidance during inference for Zipvoice.",
🤖 Prompt for AI Agents
In python-api-examples/offline-zeroshot-tts.py around lines 116 to 117, the CLI
help string contains a duplicated word "for for" ("The scale classifier-free
guidance during inference for for Zipvoice."); update the help text to remove
the extra "for" so it reads "The scale classifier-free guidance during inference
for Zipvoice." and ensure surrounding punctuation and spacing remain correct.

Comment thread python-api-examples/offline-zeroshot-tts.py
Comment thread sherpa-onnx/csrc/offline-tts-impl.h
Comment on lines +34 to +41
virtual GeneratedAudio Generate(
const std::string &text, const std::string &prompt_text,
const std::vector<float> &prompt_samples, int32_t sample_rate,
float speed = 1.0, int32_t num_step = 4,
GeneratedAudioCallback callback = nullptr) const {
SHERPA_ONNX_LOGE("This interface is for Zero-shot TTS.");
exit(-1);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Same issue for zero-shot Generate overload; also fix num_step naming

The zero-shot overload also logs and calls exit(-1). Use a pure-virtual or throw. Also s/num_step/num_steps for consistency with the rest of the codebase and Python binding.

-  virtual GeneratedAudio Generate(
+  virtual GeneratedAudio Generate(
       const std::string &text, const std::string &prompt_text,
       const std::vector<float> &prompt_samples, int32_t sample_rate,
-      float speed = 1.0, int32_t num_step = 4,
-      GeneratedAudioCallback callback = nullptr) const {
-    SHERPA_ONNX_LOGE("This interface is for Zero-shot TTS.");
-    exit(-1);
-  }
+      float speed = 1.0, int32_t num_steps = 4,
+      GeneratedAudioCallback callback = nullptr) const = 0;

Or, if keeping a body:

-    SHERPA_ONNX_LOGE("This interface is for Zero-shot TTS.");
-    exit(-1);
+    throw std::runtime_error("OfflineTtsImpl backend does not support zero-shot Generate()");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
virtual GeneratedAudio Generate(
const std::string &text, const std::string &prompt_text,
const std::vector<float> &prompt_samples, int32_t sample_rate,
float speed = 1.0, int32_t num_step = 4,
GeneratedAudioCallback callback = nullptr) const {
SHERPA_ONNX_LOGE("This interface is for Zero-shot TTS.");
exit(-1);
}
virtual GeneratedAudio Generate(
const std::string &text,
const std::string &prompt_text,
const std::vector<float> &prompt_samples,
int32_t sample_rate,
float speed = 1.0, int32_t num_steps = 4,
GeneratedAudioCallback callback = nullptr) const = 0;

Comment thread sherpa-onnx/csrc/offline-tts-impl.h
Comment thread sherpa-onnx/csrc/offline-tts.h Outdated
Comment on lines +98 to +116
// @param text The string to be generated.
// @param prompt_text The transcribe of prompt_sampes.
// @param prompt_samples The audio samples of the prompt, single channel pcm.
// @param sample_rate The sample rate of the prompt audio.
// @param speed The speed for the generated speech. E.g., 2 means 2x faster.
// @param num_steps The number of flow steps to generate the audio.
// @param callback If not NULL, it is called whenever config.max_num_sentences
// sentences have been processed. Note that the passed
// pointer `samples` for the callback might be invalidated
// after the callback is returned, so the caller should not
// keep a reference to it. The caller can copy the data if
// he/she wants to access the samples after the callback
// returns. The callback is called in the current thread.
GeneratedAudio Generate(const std::string &text,
const std::string &prompt_text,
const std::vector<float> &prompt_samples,
int32_t sample_rate, float speed = 1.0,
int32_t num_steps = 4,
GeneratedAudioCallback callback = nullptr) const;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Fix typos and improve parameter docs for the new Generate overload.

  • Typo: “transcribe of prompt_sampes” → “transcript of prompt_samples”.
  • Consider clarifying that prompt_samples are mono PCM floats in [-1, 1] and sample_rate must be positive.
  • If num_steps here is intended to override config.model.zipvoice.num_steps for this call, document precedence.

Apply this doc fix (typos + minor wording):

-  // @param text The string to be generated.
-  // @param prompt_text The transcribe of prompt_sampes.
-  // @param prompt_samples The audio samples of the prompt, single channel pcm.
-  // @param sample_rate The sample rate of the prompt audio.
+  // @param text The string to be synthesized.
+  // @param prompt_text The transcript of prompt_samples.
+  // @param prompt_samples The prompt audio samples (mono PCM float, range [-1, 1]).
+  // @param sample_rate The sample rate (Hz) of prompt_samples; must be > 0.
   // @param speed The speed for the generated speech. E.g., 2 means 2x faster.
-  // @param num_steps The number of flow steps to generate the audio.
+  // @param num_steps The number of flow steps used to generate the audio.
   // @param callback If not NULL, it is called whenever config.max_num_sentences
   //                 sentences have been processed. Note that the passed
   //                 pointer `samples` for the callback might be invalidated
   //                 after the callback is returned, so the caller should not
   //                 keep a reference to it. The caller can copy the data if
   //                 he/she wants to access the samples after the callback
   //                 returns. The callback is called in the current thread.

Please confirm precedence between this num_steps parameter and config.model.zipvoice.num_steps (per-call override vs. config default), and update the comment accordingly.


I’ll check how num_steps interacts with config.model.zipvoice.num_steps in the implementation.


🏁 Script executed:

#!/bin/bash
# Locate the offline-tts implementation file
fd -t f -g "offline-tts.cc"

# Inspect the Generate overload to see how num_steps is used
rg -n "Generate" -C3 --ignore-case sherpa-onnx/csrc/offline-tts.cc

# Find where config.model.zipvoice.num_steps is referenced
rg -n "zipvoice" -C3 sherpa-onnx

Length of output: 43138


🏁 Script executed:

#!/bin/bash
# Search for the implementation of Generate in OfflineTtsZipvoiceImpl
rg -n "class OfflineTtsZipvoiceImpl" -C5 sherpa-onnx/csrc/offline-tts-zipvoice-impl.h

# Search for the Generate method declaration in the same file
rg -n "Generate" -C3 sherpa-onnx/csrc/offline-tts-zipvoice-impl.h

Length of output: 1574


🏁 Script executed:

#!/bin/bash
# Check if the ZipVoice override parameter num_steps is used in preference to config.model.zipvoice.num_steps
rg -n "config\\.model\\.zipvoice\\.num_steps" -C3 sherpa-onnx/csrc/offline-tts-zipvoice-impl.cc

# Inspect the Generate implementation to see how the passed-in num_steps is applied
sed -n '1,200p' sherpa-onnx/csrc/offline-tts-zipvoice-impl.cc | rg -n "Generate" -C20

Length of output: 353


Update docs for Generate overload in offline-tts.h

Confirmed that the num_steps parameter in this overload is always used directly (it takes precedence over config.model.zipvoice.num_steps). Please apply the following documentation fixes:

--- a/sherpa-onnx/csrc/offline-tts.h
+++ b/sherpa-onnx/csrc/offline-tts.h
@@ -98,11 +98,13 @@ class OfflineTtsImpl {
-  // @param text The string to be generated.
-  // @param prompt_text The transcribe of prompt_sampes.
-  // @param prompt_samples The audio samples of the prompt, single channel pcm.
-  // @param sample_rate The sample rate of the prompt audio.
+  // @param text The string to be synthesized.
+  // @param prompt_text The transcript of `prompt_samples`.
+  // @param prompt_samples The prompt audio samples (mono PCM floats in [-1, 1]).
+  // @param sample_rate The sample rate of `prompt_samples` in Hz; must be > 0.
   // @param speed The speed for the generated speech. E.g., 2 means 2× faster.
-  // @param num_steps The number of flow steps to generate the audio.
+  // @param num_steps The number of flow steps used to generate audio.
+  //                  This per-call value overrides `config.model.zipvoice.num_steps`.
   // @param callback If not NULL, it is called whenever config.max_num_sentences
   //                 sentences have been processed. Note that the passed
   //                 pointer `samples` for the callback might be invalidated

sherpa-onnx/csrc/offline-tts.h:98–104

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// @param text The string to be generated.
// @param prompt_text The transcribe of prompt_sampes.
// @param prompt_samples The audio samples of the prompt, single channel pcm.
// @param sample_rate The sample rate of the prompt audio.
// @param speed The speed for the generated speech. E.g., 2 means 2x faster.
// @param num_steps The number of flow steps to generate the audio.
// @param callback If not NULL, it is called whenever config.max_num_sentences
// sentences have been processed. Note that the passed
// pointer `samples` for the callback might be invalidated
// after the callback is returned, so the caller should not
// keep a reference to it. The caller can copy the data if
// he/she wants to access the samples after the callback
// returns. The callback is called in the current thread.
GeneratedAudio Generate(const std::string &text,
const std::string &prompt_text,
const std::vector<float> &prompt_samples,
int32_t sample_rate, float speed = 1.0,
int32_t num_steps = 4,
GeneratedAudioCallback callback = nullptr) const;
// @param text The string to be synthesized.
// @param prompt_text The transcript of `prompt_samples`.
// @param prompt_samples The prompt audio samples (mono PCM floats in [-1, 1]).
// @param sample_rate The sample rate of `prompt_samples` in Hz; must be > 0.
// @param speed The speed for the generated speech. E.g., 2× faster.
// @param num_steps The number of flow steps used to generate audio.
// This per-call value overrides `config.model.zipvoice.num_steps`.
// @param callback If not NULL, it is called whenever config.max_num_sentences
// sentences have been processed. Note that the passed
// pointer `samples` for the callback might be invalidated
// after the callback is returned, so the caller should not
// keep a reference to it. The caller can copy the data if
// he/she wants to access the samples after the callback
// returns. The callback is called in the current thread.
GeneratedAudio Generate(const std::string &text,
const std::string &prompt_text,
const std::vector<float> &prompt_samples,
int32_t sample_rate, float speed = 1.0,
int32_t num_steps = 4,
GeneratedAudioCallback callback = nullptr) const;
🤖 Prompt for AI Agents
In sherpa-onnx/csrc/offline-tts.h around lines 98 to 116, the Generate overload
docs are ambiguous about num_steps; update the comment for the num_steps
parameter to state explicitly that this argument is always used directly (it
takes precedence over config.model.zipvoice.num_steps) and that it controls the
number of flow steps for audio generation; adjust any wording that suggests the
config can override this parameter and keep the rest of the parameter
descriptions unchanged.

Comment thread sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc
Comment thread sherpa-onnx/csrc/text-utils-test.cc Outdated
Comment on lines +194 to +201
if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) {
// if the previous part is "other", we start a new sentence
oss << text_parts[i];
t_lang = types[i];
} else {
if ((t_lang == types[i] || types[i] == "other") && t_lang != "pinyin" &&
t_lang != "tag") {
// same language, continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix always-true condition: use && instead of ||

(types[i] != "tag" || types[i] != "pinyin") is always true, so the branch executes for every types[i]. The intended logic is clearly “not tag and not pinyin”.

Apply this diff:

-      if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) {
+      if (t_lang == "other" && (types[i] != "tag" && types[i] != "pinyin")) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) {
// if the previous part is "other", we start a new sentence
oss << text_parts[i];
t_lang = types[i];
} else {
if ((t_lang == types[i] || types[i] == "other") && t_lang != "pinyin" &&
t_lang != "tag") {
// same language, continue
if (t_lang == "other" && (types[i] != "tag" && types[i] != "pinyin")) {
// if the previous part is "other", we start a new sentence
oss << text_parts[i];
t_lang = types[i];
} else {
if ((t_lang == types[i] || types[i] == "other") && t_lang != "pinyin" &&
t_lang != "tag") {
// same language, continue
🤖 Prompt for AI Agents
In sherpa-onnx/csrc/text-utils-test.cc around lines 194 to 201, the condition if
(t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) is always
true; change the inner boolean to use && so it reads (types[i] != "tag" &&
types[i] != "pinyin") to correctly express "not tag AND not pinyin", keeping
existing parentheses and spacing.

Comment thread sherpa-onnx/csrc/text-utils.cc Outdated
Comment on lines +108 to +120
std::vector<char> buffer;
if (!config.matcha.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
buffer = ReadFile(mgr, config.matcha.vocoder);
} else if (!config.zipvoice.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
config.zipvoice.vocoder.c_str());
buffer = ReadFile(mgr, config.zipvoice.vocoder);
} else {
SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
exit(-1);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Same fix for the resource-manager overload

Mirror the selected-path handling and avoid exit(-1).

-  std::vector<char> buffer;
+  std::vector<char> buffer;
+  std::string vocoder_path;
   if (!config.matcha.vocoder.empty()) {
     SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
-    buffer = ReadFile(mgr, config.matcha.vocoder);
+    vocoder_path = config.matcha.vocoder;
+    buffer = ReadFile(mgr, vocoder_path);
   } else if (!config.zipvoice.vocoder.empty()) {
     SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
                      config.zipvoice.vocoder.c_str());
-    buffer = ReadFile(mgr, config.zipvoice.vocoder);
+    vocoder_path = config.zipvoice.vocoder;
+    buffer = ReadFile(mgr, vocoder_path);
   } else {
     SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
-    exit(-1);
+    return nullptr;
   }

Follow-up required outside this hunk:

  • In the subsequent kHifigan branch, pass vocoder_path to HifiganVocoder(mgr, ...).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<char> buffer;
if (!config.matcha.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
buffer = ReadFile(mgr, config.matcha.vocoder);
} else if (!config.zipvoice.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
config.zipvoice.vocoder.c_str());
buffer = ReadFile(mgr, config.zipvoice.vocoder);
} else {
SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
exit(-1);
}
std::vector<char> buffer;
std::string vocoder_path;
if (!config.matcha.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
vocoder_path = config.matcha.vocoder;
buffer = ReadFile(mgr, vocoder_path);
} else if (!config.zipvoice.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
config.zipvoice.vocoder.c_str());
vocoder_path = config.zipvoice.vocoder;
buffer = ReadFile(mgr, vocoder_path);
} else {
SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
return nullptr;
}
🤖 Prompt for AI Agents
In sherpa-onnx/csrc/vocoder.cc around lines 108 to 120, the current logic
directly reads into a buffer and calls exit(-1) on missing vocoder; instead,
create a std::string vocoder_path and set it from config.matcha.vocoder or
config.zipvoice.vocoder (mirroring the selected-path handling), log which path
is chosen, use ReadFile(mgr, vocoder_path) to populate the buffer, and remove
the exit(-1) path by returning an error or throwing as appropriate;
additionally, in the later kHifigan branch, pass this vocoder_path into the
HifiganVocoder(mgr, ...) constructor call.

@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Aug 27, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
sherpa-onnx/csrc/vocoder.cc (2)

93-96: Pass the actually selected vocoder_path to HifiganVocoder

Using config.matcha.vocoder here breaks when the selected path is ZipVoice’s.

-      return std::make_unique<HifiganVocoder>(
-          config.num_threads, config.provider, config.matcha.vocoder);
+      return std::make_unique<HifiganVocoder>(
+          config.num_threads, config.provider, vocoder_path);

125-127: Pass vocoder_path in the Manager overload as well*

-      return std::make_unique<HifiganVocoder>(
-          mgr, config.num_threads, config.provider, config.matcha.vocoder);
+      return std::make_unique<HifiganVocoder>(
+          mgr, config.num_threads, config.provider, vocoder_path);
♻️ Duplicate comments (7)
sherpa-onnx/csrc/vocoder.cc (2)

77-88: Use info-level logs, avoid exit(), and capture the selected vocoder path

Switch to info logs for selection, record the chosen path, and return nullptr instead of terminating the process. Also read via the selected path.

-  std::vector<char> buffer;
+  std::vector<char> buffer;
+  std::string vocoder_path;
   if (!config.matcha.vocoder.empty()) {
-    SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
-    buffer = ReadFile(config.matcha.vocoder);
+    SHERPA_ONNX_LOGI("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
+    vocoder_path = config.matcha.vocoder;
+    buffer = ReadFile(vocoder_path);
   } else if (!config.zipvoice.vocoder.empty()) {
-    SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
-                     config.zipvoice.vocoder.c_str());
-    buffer = ReadFile(config.zipvoice.vocoder);
+    SHERPA_ONNX_LOGI("Using zipvoice vocoder: %s",
+                     config.zipvoice.vocoder.c_str());
+    vocoder_path = config.zipvoice.vocoder;
+    buffer = ReadFile(vocoder_path);
   } else {
     SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
-    exit(-1);
+    return nullptr;
   }

108-119: Mirror the same fixes for the Manager overload (log level, selected path, no exit)*

Use info logs, track vocoder_path, and avoid exit.

-  std::vector<char> buffer;
+  std::vector<char> buffer;
+  std::string vocoder_path;
   if (!config.matcha.vocoder.empty()) {
-    SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
-    buffer = ReadFile(mgr, config.matcha.vocoder);
+    SHERPA_ONNX_LOGI("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
+    vocoder_path = config.matcha.vocoder;
+    buffer = ReadFile(mgr, vocoder_path);
   } else if (!config.zipvoice.vocoder.empty()) {
-    SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
-                     config.zipvoice.vocoder.c_str());
-    buffer = ReadFile(mgr, config.zipvoice.vocoder);
+    SHERPA_ONNX_LOGI("Using zipvoice vocoder: %s",
+                     config.zipvoice.vocoder.c_str());
+    vocoder_path = config.zipvoice.vocoder;
+    buffer = ReadFile(mgr, vocoder_path);
   } else {
     SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
     return nullptr;
   }
sherpa-onnx/csrc/offline-tts-impl.h (3)

28-34: Prefer pure-virtual over throwing in the base Generate()

Restores compile-time guarantees and avoids surprising runtime throws for non-zero-shot backends.

-  virtual GeneratedAudio Generate(
-      const std::string &text, int64_t sid = 0, float speed = 1.0,
-      GeneratedAudioCallback callback = nullptr) const {
-    throw std::runtime_error(
-        "OfflineTtsImpl backend does not support non zero-shot Generate()");
-  }
+  virtual GeneratedAudio Generate(
+      const std::string &text, int64_t sid = 0, float speed = 1.0,
+      GeneratedAudioCallback callback = nullptr) const = 0;

35-43: Same for zero-shot overload; also align param name with public API (num_steps)

Use pure-virtual and rename num_step to num_steps for consistency.

-  virtual GeneratedAudio Generate(
-      const std::string &text, const std::string &prompt_text,
-      const std::vector<float> &prompt_samples, int32_t sample_rate,
-      float speed = 1.0, int32_t num_step = 4,
-      GeneratedAudioCallback callback = nullptr) const {
-    throw std::runtime_error(
-        "OfflineTtsImpl backend does not support zero-shot Generate()");
-  }
+  virtual GeneratedAudio Generate(
+      const std::string &text, const std::string &prompt_text,
+      const std::vector<float> &prompt_samples, int32_t sample_rate,
+      float speed = 1.0, int32_t num_steps = 4,
+      GeneratedAudioCallback callback = nullptr) const = 0;

49-52: Keep NumSpeakers() pure-virtual

Avoids hard runtime errors and keeps interfaces explicit.

-  virtual int32_t NumSpeakers() const {
-    throw std::runtime_error(
-        "Zero-shot OfflineTts does not support NumSpeakers()");
-  }
+  virtual int32_t NumSpeakers() const = 0;
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (2)

10-10: Replace deprecated and add missing include

is required for std::array used below; is deprecated.

-#include <strstream>
+#include <sstream>
+#include <array>

235-249: Handle empty prompt_samples to avoid division by zero in RMS

Protect sqrt(sum/0) and only scale when non-empty.

-    prompt_rms = std::sqrt(sum_sq / prompt_samples_scaled.size());
-    if (prompt_rms < target_rms && prompt_rms > 0.0f) {
+    if (!prompt_samples_scaled.empty()) {
+      prompt_rms = std::sqrt(sum_sq / prompt_samples_scaled.size());
+    } else {
+      prompt_rms = 0.0f;
+    }
+    if (!prompt_samples_scaled.empty() &&
+        prompt_rms < target_rms && prompt_rms > 0.0f) {
       float scale = target_rms / static_cast<float>(prompt_rms);
       for (auto &s : prompt_samples_scaled) {
         s *= scale;
       }
     }
🧹 Nitpick comments (13)
sherpa-onnx/csrc/text-utils-test.cc (2)

7-8: Remove unused headers; include what you use

<regex> and <sstream> aren’t used. The file uses std::vector, uint8_t, and streams; include those explicitly to avoid relying on transitive headers.

-#include <regex>
-#include <sstream>
+#include <vector>
+#include <cstdint>
+#include <iostream>

85-89: Deduplicate identical tests for 0x20 0xC4 case

Three tests assert the same behavior (space followed by orphan 0xC4 → space). Keep one with a precise name to reduce noise.

-TEST(RemoveInvalidUtf8Sequences, BreakingCase_SpaceFollowedByInvalidByte) {
-  std::string input = "\x20\xC4";  // Space followed by an invalid byte
-  std::string expected = " ";      // 0xC4 removed
+TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidLeadByteAfterSpace) {
+  std::string input = "\x20\xC4";  // space 0x20 then orphan lead byte 0xC4
+  std::string expected = " ";      // 0xC4 removed; space remains
   EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected);
 }
 
-TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidByte) {
-  std::string input = "\x20\xC4";  // Space (0x20) followed by invalid (0xC4)
-  std::string expected = " ";      // Space remains, 0xC4 is removed
-  EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected);
-}
-
-TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidByte_Breaking) {
-  std::string input = "\x20\xc4";  // Space followed by invalid `0xc4`
-  std::string expected = " ";      // `0xc4` should be removed, space remains
-  EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected);
-}

Also applies to: 102-106, 114-118

sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (8)

5-11: Add missing header for printf/fprintf; drop unused include.

Include explicitly and remove unused to avoid transitive-dependency build fragility and warnings.

-#include <chrono>  // NOLINT
-#include <fstream>
+#include <chrono>  // NOLINT
+#include <cstdio>

13-17: Throttle progress logs; current per-callback printf will spam and slow down TTS.

Print only on noticeable progress deltas and reuse the same line.

-static int32_t AudioCallback(const float * /*samples*/, int32_t n,
-                             float progress) {
-  printf("sample=%d, progress=%f\n", n, progress);
-  return 1;
-}
+static int32_t AudioCallback(const float * /*samples*/, int32_t /*n*/,
+                             float progress) {
+  static float last = -1.0f;
+  if (progress >= 1.0f || progress - last >= 0.01f) {
+    printf("\rprogress=%.1f%%", progress * 100.0f);
+    fflush(stdout);
+    last = progress;
+  }
+  return 1;
+}

69-71: Validate num-steps and speed after parsing.

Prevent invalid inputs early; avoids cryptic downstream failures.

   config.Register(&po);
   po.Read(argc, argv);
 
+  if (num_steps <= 0) {
+    fprintf(stderr, "Error: --num-steps must be >= 1\n");
+    exit(EXIT_FAILURE);
+  }
+  if (speed <= 0) {
+    fprintf(stderr, "Error: --speed must be > 0\n");
+    exit(EXIT_FAILURE);
+  }

62-66: Fix help text wording and clarify expected audio format.

Minor copyedits; clarify it’s a mono PCM WAV file.

-  po.Register("prompt-text", &prompt_text, "The transcribe of prompt_samples.");
+  po.Register("prompt-text", &prompt_text,
+              "Transcript of the prompt audio (must match --prompt-audio).");
 
   po.Register("prompt-audio", &prompt_audio,
-              "The prompt audio file, single channel pcm. ");
+              "Path to the prompt audio file (single-channel PCM WAV).");

78-83: Make the positional-arg guidance OS-agnostic.

Users on Windows often need double quotes. Use “quotes” generically.

-    fprintf(stderr,
-            "Error: Accept only one positional argument. Please use single "
-            "quotes to wrap your text\n");
+    fprintf(stderr,
+            "Error: Accept only one positional argument. Please use quotes to "
+            "wrap your text if it contains spaces.\n");

107-110: Unify failure exit path.

Elsewhere you use exit(EXIT_FAILURE); returning -1 here is inconsistent.

-    fprintf(stderr, "Failed to read '%s'\n", prompt_audio.c_str());
-    return -1;
+    fprintf(stderr, "Failed to read '%s'\n", prompt_audio.c_str());
+    exit(EXIT_FAILURE);

124-127: Simplify elapsed time calculation.

Cleaner and avoids integer truncation concerns.

-  float elapsed_seconds =
-      std::chrono::duration_cast<std::chrono::milliseconds>(end - begin)
-          .count() /
-      1000.;
+  float elapsed_seconds =
+      std::chrono::duration<float>(end - begin).count();

143-145: Send success messages to stdout, keep diagnostics on stderr.

Improves CLI UX and stream separation.

-  fprintf(stderr, "The text is: %s.\n", po.GetArg(1).c_str());
-  fprintf(stderr, "Saved to %s successfully!\n", output_filename.c_str());
+  fprintf(stdout, "The text is: %s.\n", po.GetArg(1).c_str());
+  fprintf(stdout, "Saved to %s successfully!\n", output_filename.c_str());
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (1)

16-17: Optional: add a brief class docstring.

Improves discoverability in help() and IDEs without impacting ABI.

-  py::class_<PyClass>(*m, "OfflineTtsZipvoiceModelConfig")
+  py::class_<PyClass>(*m, "OfflineTtsZipvoiceModelConfig",
+                      "Configuration for ZipVoice offline TTS.\n"
+                      "Fields: tokens, text_model, flow_matching_model, vocoder, "
+                      "data_dir, pinyin_dict, feat_scale, t_shift, target_rms, guidance_scale.")
sherpa-onnx/csrc/vocoder.cc (1)

26-30: Fix typo: kVocoos → kVocos

 enum class ModelType : std::uint8_t {
   kHifigan,
-  kVocoos,
+  kVocos,
   kUnknown,
 };

-  } else if (model_type == "vocos") {
-    return ModelType::kVocoos;
+  } else if (model_type == "vocos") {
+    return ModelType::kVocos;

-    case ModelType::kVocoos:
+    case ModelType::kVocos:
       return std::make_unique<VocosVocoder>(config);

Also applies to: 68-69, 95-96

sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)

124-131: Use debug/info-level logging for resampler creation

Don’t log at error level for expected paths; gate on debug.

-      SHERPA_ONNX_LOGE(
-          "Creating a resampler:\n"
-          "   in_sample_rate: %d\n"
-          "   output_sample_rate: %d\n",
-          sample_rate, static_cast<int32_t>(meta.sample_rate));
+      if (config_.model.debug) {
+        SHERPA_ONNX_LOGI(
+            "Creating a resampler:\n"
+            "   in_sample_rate: %d\n"
+            "   output_sample_rate: %d",
+            sample_rate, static_cast<int32_t>(meta.sample_rate));
+      }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e0ddc39 and 10e4999.

📒 Files selected for processing (15)
  • python-api-examples/offline-zeroshot-tts.py (1 hunks)
  • sherpa-onnx/csrc/offline-tts-impl.h (2 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1 hunks)
  • sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (1 hunks)
  • sherpa-onnx/csrc/offline-tts.cc (1 hunks)
  • sherpa-onnx/csrc/offline-tts.h (1 hunks)
  • sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (1 hunks)
  • sherpa-onnx/csrc/text-utils-test.cc (3 hunks)
  • sherpa-onnx/csrc/text-utils.cc (1 hunks)
  • sherpa-onnx/csrc/vocoder.cc (2 hunks)
  • sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc
  • sherpa-onnx/csrc/offline-tts.cc
  • python-api-examples/offline-zeroshot-tts.py
  • sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h
  • sherpa-onnx/csrc/offline-tts.h
  • sherpa-onnx/csrc/text-utils.cc
  • sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h
  • sherpa-onnx/csrc/offline-tts-zipvoice-model.cc
  • sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc
🧰 Additional context used
🧬 Code graph analysis (5)
sherpa-onnx/csrc/vocoder.cc (2)
nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_en.js (1)
  • config (8-22)
nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_zh.js (1)
  • config (8-24)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (5)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (4)
  • ToString (131-147)
  • ToString (131-131)
  • Validate (40-129)
  • Validate (40-40)
sherpa-onnx/python/csrc/offline-tts-model-config.cc (1)
  • PybindOfflineTtsModelConfig (15-48)
sherpa-onnx/python/csrc/offline-tts-kokoro-model-config.cc (1)
  • PybindOfflineTtsKokoroModelConfig (11-38)
sherpa-onnx/python/csrc/offline-tts-vits-model-config.cc (1)
  • PybindOfflineTtsVitsModelConfig (11-37)
sherpa-onnx/python/csrc/offline-tts-matcha-model-config.cc (1)
  • PybindOfflineTtsMatchaModelConfig (11-37)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (8)
sherpa-onnx/csrc/offline-tts-impl.h (2)
  • sherpa_onnx (16-56)
  • OfflineTtsImpl (18-52)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (2)
  • sherpa_onnx (17-60)
  • OfflineTtsZipvoiceFrontend (19-58)
sherpa-onnx/csrc/offline-tts.h (1)
  • sherpa_onnx (16-129)
sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (1)
  • sherpa_onnx (11-28)
sherpa-onnx/csrc/offline-tts-zipvoice-model.h (2)
  • sherpa_onnx (15-39)
  • OfflineTtsZipvoiceModel (17-37)
sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (7)
  • OfflineTtsZipvoiceModel (288-290)
  • OfflineTtsZipvoiceModel (293-295)
  • OfflineTtsZipvoiceModel (297-297)
  • OfflineTtsZipvoiceModel (314-315)
  • OfflineTtsZipvoiceModel (319-320)
  • tokens (58-180)
  • tokens (58-59)
sherpa-onnx/csrc/vocoder.cc (6)
  • Create (76-103)
  • Create (76-76)
  • Create (106-133)
  • Create (106-107)
  • Create (136-137)
  • Create (141-142)
sherpa-onnx/csrc/vocoder.h (1)
  • Vocoder (17-31)
sherpa-onnx/csrc/text-utils-test.cc (1)
sherpa-onnx/csrc/text-utils.cc (2)
  • RemoveInvalidUtf8Sequences (486-568)
  • RemoveInvalidUtf8Sequences (486-487)
sherpa-onnx/csrc/offline-tts-impl.h (4)
harmony-os/SherpaOnnxHar/sherpa_onnx/src/main/cpp/non-streaming-tts.cc (1)
  • callback (469-496)
sherpa-onnx/csrc/offline-recognizer-impl.h (1)
  • std (36-40)
scripts/go/sherpa_onnx.go (1)
  • GeneratedAudio (963-968)
sherpa-onnx/csrc/offline-tts.cc (4)
  • SampleRate (239-239)
  • SampleRate (239-239)
  • NumSpeakers (241-241)
  • NumSpeakers (241-241)
🔇 Additional comments (9)
sherpa-onnx/csrc/text-utils-test.cc (1)

109-112: LGTM: truncated lead byte is removed

Expectation aligns with RemoveInvalidUtf8Sequences semantics (drop invalid 0xC4, keep surrounding bytes).

sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (1)

95-98: Wording now matches validation (both flags required).

Good fix; aligns error message with the OR-check.

sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (5)

5-10: Includes look correct and minimal.

Matches the pattern used by other TTS config bindings.


27-36: Attributes exposure LGTM.

Read/write fields mirror the C++ struct and match Validate/ToString usage.


37-38: str and validate bindings are correct.

Consistent with other TTS configs; returns ToString() and Validate().


16-26: ✔️ C++ ctor signature and defaults are correctly bound.

Confirmed that OfflineTtsZipvoiceModelConfig declares a matching 10-arg constructor (lines 35–50 in sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h) and that all field defaults align:

  • data_dir and pinyin_dict default to empty strings (default-constructed std::string).
  • feat_scale = 0.1, t_shift = 0.5, target_rms = 0.1, guidance_scale = 1.0 match both the C++ ctor defaults and field initializers.

No changes required.


13-39: Resolved: Zipvoice binding already integrated

The OfflineTtsZipvoiceModelConfig binding registration and Python export are already present:

• In sherpa-onnx/python/csrc/offline-tts-model-config.cc, line 22 calls PybindOfflineTtsZipvoiceModelConfig(m);.
• In sherpa-onnx/python/sherpa_onnx/__init__.py, line 52 exports OfflineTtsZipvoiceModelConfig.

No further changes are needed.

sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (2)

258-286: Verify de-scaling mel by feat_scale is theoretically justified

You scale prompt_features by feat_scale before inference, then divide mel by feat_scale afterwards. Given model nonlinearity, this may not invert cleanly. Recommend ablation/bench before keeping.


168-173: Confirm window_type Value with kaldi-native-fbank
I wasn’t able to locate a definitive list of accepted window_type strings for knf::FrameExtractionOptions in the kaldi-native-fbank source. Since some implementations expect “hann” or “hamming” rather than “hanning,” please:

  • Consult the kaldi-native-fbank documentation or header where FrameExtractionOptions::window_type is defined.
  • Verify that “hanning” is recognized (or adjust to “hann”/“hamming” as required).

Comment on lines +35 to 43
virtual GeneratedAudio Generate(
const std::string &text, const std::string &prompt_text,
const std::vector<float> &prompt_samples, int32_t sample_rate,
float speed = 1.0, int32_t num_step = 4,
GeneratedAudioCallback callback = nullptr) const {
throw std::runtime_error(
"OfflineTtsImpl backend does not support zero-shot Generate()");
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Check for mixed usage of num_step vs num_steps across the repo

Run to confirm consistent naming:


🏁 Script executed:

#!/bin/bash
rg -nP -C2 '\bnum_step(s)?\b' --type=cpp --type=cc --type=h --type=hpp

Length of output: 99


To ensure we catch both num_step and num_steps occurrences across all common C/C++ file extensions, let’s retry with explicit glob patterns:


🏁 Script executed:

#!/bin/bash
# Search for both 'num_step' and 'num_steps' in C/C++ sources
rg -nP -C2 '\bnum_steps?\b' \
  -g '*.cpp' -g '*.cc' -g '*.cxx' \
  -g '*.h'   -g '*.hpp' -g '*.hh' .

Length of output: 11018


Parameter Naming Consistency: Rename num_step to num_steps

There’s a single occurrence of the singular num_step in the implementation header, while the rest of the public API and all other overrides consistently use num_steps. Let’s align them.

• sherpa-onnx/csrc/offline-tts-impl.h:38
Replace

virtual GeneratedAudio Generate(
    const std::string &text, const std::string &prompt_text,
    const std::vector<float> &prompt_samples, int32_t sample_rate,
    float speed = 1.0, int32_t num_step = 4,
    GeneratedAudioCallback callback = nullptr) const {
  …
}

with

virtual GeneratedAudio Generate(
    const std::string &text, const std::string &prompt_text,
    const std::vector<float> &prompt_samples, int32_t sample_rate,
    float speed = 1.0, int32_t num_steps = 4,
    GeneratedAudioCallback callback = nullptr) const {

After renaming, rebuild to ensure no breakages in overrides.

🤖 Prompt for AI Agents
In sherpa-onnx/csrc/offline-tts-impl.h around lines 35 to 43, the parameter name
num_step is inconsistent with the rest of the public API which uses num_steps;
rename the parameter from num_step to num_steps in the Generate declaration and
any matching default value or usages in that header so the signature matches
overrides, then rebuild and run tests to ensure no remaining mismatches.

Comment on lines +30 to +43
explicit OfflineTtsZipvoiceImpl(const OfflineTtsConfig &config)
: config_(config),
model_(std::make_unique<OfflineTtsZipvoiceModel>(config.model)),
vocoder_(Vocoder::Create(config.model)) {
InitFrontend();
}

template <typename Manager>
OfflineTtsZipvoiceImpl(Manager *mgr, const OfflineTtsConfig &config)
: config_(config),
model_(std::make_unique<OfflineTtsZipvoiceModel>(mgr, config.model)),
vocoder_(Vocoder::Create(mgr, config.model)) {
InitFrontend(mgr);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard against null vocoder_ from Create()

Avoid null deref in Generate/Process. Throw a recoverable error.

   explicit OfflineTtsZipvoiceImpl(const OfflineTtsConfig &config)
       : config_(config),
         model_(std::make_unique<OfflineTtsZipvoiceModel>(config.model)),
         vocoder_(Vocoder::Create(config.model)) {
-    InitFrontend();
+    if (!vocoder_) {
+      throw std::runtime_error("Vocoder::Create() returned null");
+    }
+    InitFrontend();
   }
@@
   OfflineTtsZipvoiceImpl(Manager *mgr, const OfflineTtsConfig &config)
       : config_(config),
         model_(std::make_unique<OfflineTtsZipvoiceModel>(mgr, config.model)),
         vocoder_(Vocoder::Create(mgr, config.model)) {
-    InitFrontend(mgr);
+    if (!vocoder_) {
+      throw std::runtime_error("Vocoder::Create(mgr) returned null");
+    }
+    InitFrontend(mgr);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
explicit OfflineTtsZipvoiceImpl(const OfflineTtsConfig &config)
: config_(config),
model_(std::make_unique<OfflineTtsZipvoiceModel>(config.model)),
vocoder_(Vocoder::Create(config.model)) {
InitFrontend();
}
template <typename Manager>
OfflineTtsZipvoiceImpl(Manager *mgr, const OfflineTtsConfig &config)
: config_(config),
model_(std::make_unique<OfflineTtsZipvoiceModel>(mgr, config.model)),
vocoder_(Vocoder::Create(mgr, config.model)) {
InitFrontend(mgr);
}
explicit OfflineTtsZipvoiceImpl(const OfflineTtsConfig &config)
: config_(config),
model_(std::make_unique<OfflineTtsZipvoiceModel>(config.model)),
vocoder_(Vocoder::Create(config.model)) {
if (!vocoder_) {
throw std::runtime_error("Vocoder::Create() returned null");
}
InitFrontend();
}
template <typename Manager>
OfflineTtsZipvoiceImpl(Manager *mgr, const OfflineTtsConfig &config)
: config_(config),
model_(std::make_unique<OfflineTtsZipvoiceModel>(mgr, config.model)),
vocoder_(Vocoder::Create(mgr, config.model)) {
if (!vocoder_) {
throw std::runtime_error("Vocoder::Create(mgr) returned null");
}
InitFrontend(mgr);
}
🤖 Prompt for AI Agents
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h around lines 30 to 43, the
constructors call Vocoder::Create(...) but don't guard against it returning
nullptr which can cause null deref later in Generate/Process; immediately after
initializing vocoder_ in both constructors check if vocoder_ is null and throw a
recoverable exception (e.g., std::runtime_error or the project's designated
recoverable error type) with a clear message including context (class name and
model identifier) so callers can handle the failure instead of crashing.

Comment on lines +101 to +117
void InitFrontend() {
const auto &meta_data = model_->GetMetaData();

if (meta_data.use_pinyin && config_.model.zipvoice.pinyin_dict.empty()) {
SHERPA_ONNX_LOGE(
"Please provide --zipvoice-pinyin-dict for converting Chinese into "
"pinyin.");
exit(-1);
}
if (meta_data.use_espeak && config_.model.zipvoice.data_dir.empty()) {
SHERPA_ONNX_LOGE("Please provide --zipvoice-data-dir for espeak-ng.");
exit(-1);
}
frontend_ = std::make_unique<OfflineTtsZipvoiceFrontend>(
config_.model.zipvoice.tokens, config_.model.zipvoice.data_dir,
config_.model.zipvoice.pinyin_dict, meta_data, config_.model.debug);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t call exit() in library code; throw instead

Surface configuration errors as exceptions.

     if (meta_data.use_pinyin && config_.model.zipvoice.pinyin_dict.empty()) {
       SHERPA_ONNX_LOGE(
           "Please provide --zipvoice-pinyin-dict for converting Chinese into "
           "pinyin.");
-      exit(-1);
+      throw std::runtime_error("--zipvoice-pinyin-dict is required by the model");
     }
     if (meta_data.use_espeak && config_.model.zipvoice.data_dir.empty()) {
       SHERPA_ONNX_LOGE("Please provide --zipvoice-data-dir for espeak-ng.");
-      exit(-1);
+      throw std::runtime_error("--zipvoice-data-dir is required by the model");
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void InitFrontend() {
const auto &meta_data = model_->GetMetaData();
if (meta_data.use_pinyin && config_.model.zipvoice.pinyin_dict.empty()) {
SHERPA_ONNX_LOGE(
"Please provide --zipvoice-pinyin-dict for converting Chinese into "
"pinyin.");
exit(-1);
}
if (meta_data.use_espeak && config_.model.zipvoice.data_dir.empty()) {
SHERPA_ONNX_LOGE("Please provide --zipvoice-data-dir for espeak-ng.");
exit(-1);
}
frontend_ = std::make_unique<OfflineTtsZipvoiceFrontend>(
config_.model.zipvoice.tokens, config_.model.zipvoice.data_dir,
config_.model.zipvoice.pinyin_dict, meta_data, config_.model.debug);
}
void InitFrontend() {
const auto &meta_data = model_->GetMetaData();
if (meta_data.use_pinyin && config_.model.zipvoice.pinyin_dict.empty()) {
SHERPA_ONNX_LOGE(
"Please provide --zipvoice-pinyin-dict for converting Chinese into "
"pinyin.");
throw std::runtime_error("--zipvoice-pinyin-dict is required by the model");
}
if (meta_data.use_espeak && config_.model.zipvoice.data_dir.empty()) {
SHERPA_ONNX_LOGE("Please provide --zipvoice-data-dir for espeak-ng.");
throw std::runtime_error("--zipvoice-data-dir is required by the model");
}
frontend_ = std::make_unique<OfflineTtsZipvoiceFrontend>(
config_.model.zipvoice.tokens, config_.model.zipvoice.data_dir,
config_.model.zipvoice.pinyin_dict, meta_data, config_.model.debug);
}
🤖 Prompt for AI Agents
In sherpa-onnx/csrc/offline-tts-zipvoice-impl.h around lines 101 to 117, replace
the calls to exit(-1) when required config is missing with thrown exceptions
(e.g., throw std::runtime_error or a more specific exception type used in the
project) that include the same descriptive error messages; construct the
exception with the message used in SHERPA_ONNX_LOGE, keep logging the error if
desired, and allow the caller to handle or propagate the exception instead of
terminating the process.

@pkufool pkufool changed the title [WIP] Add Zipvoice Add Zipvoice Aug 27, 2025
@pkufool
Copy link
Copy Markdown
Contributor Author

pkufool commented Aug 27, 2025

I test it in my local machine, the c++ / python binary runs normally, I will merge it first, so people can go head based on this PR. Possible potential issues will be fixed in the following PRs.

@pkufool pkufool merged commit c149696 into k2-fsa:master Aug 27, 2025
29 of 43 checks passed
@pkufool
Copy link
Copy Markdown
Contributor Author

pkufool commented Aug 27, 2025

Read the comments in the top of sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc and python-api-examples/offline-zeroshot-tts.py for how to run the program.

@csukuangfj csukuangfj mentioned this pull request Sep 1, 2025
@coderabbitai coderabbitai Bot mentioned this pull request Sep 1, 2025
@coderabbitai coderabbitai Bot mentioned this pull request Sep 24, 2025
@thewh1teagle
Copy link
Copy Markdown
Contributor

thewh1teagle commented Oct 12, 2025

Amazing work done here! any chance you have a pure Python script that uses ZipVoice?

Tried to vibe-code it without success

Details
#!/usr/bin/env python3
"""
Minimal ZipVoice ONNX inference example with hardcoded phonemes.
No phonemization - just direct phoneme-to-ID conversion using tokens.txt
"""

import numpy as np
import onnxruntime as ort
import wave
import soundfile as sf
import kaldi_native_fbank as knf


def load_tokens(tokens_file):
    """Load token-to-ID mapping from tokens.txt"""
    token2id = {}
    with open(tokens_file, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) == 2:
                token, idx = parts
                token2id[token] = int(idx)
            elif len(parts) == 1:
                # Single number means space token
                token2id[' '] = int(parts[0])
    return token2id


def phonemes_to_ids(phonemes, token2id):
    """Convert list of phonemes to token IDs"""
    ids = []
    for phoneme in phonemes:
        if phoneme in token2id:
            ids.append(token2id[phoneme])
        else:
            print(f"Warning: Unknown phoneme '{phoneme}', skipping")
    return np.array(ids, dtype=np.int64)


def extract_mel_spectrogram_kaldi(audio, sample_rate, n_mels=100):
    """
    Extract mel spectrogram using kaldi-native-fbank (same as sherpa-onnx uses)
    
    Args:
        audio: Audio samples as numpy array  
        sample_rate: Sample rate of the audio
        n_mels: Number of mel bands
        
    Returns:
        mel_spectrogram: [num_frames, n_mels] mel spectrogram
    """
    # Configure fbank options
    opts = knf.FbankOptions()
    opts.frame_opts.samp_freq = sample_rate
    opts.frame_opts.frame_shift_ms = 256.0 / sample_rate * 1000  # hop_length in ms
    opts.frame_opts.frame_length_ms = 1024.0 / sample_rate * 1000  # n_fft in ms
    opts.frame_opts.dither = 0.0
    opts.frame_opts.preemph_coeff = 0.0
    opts.frame_opts.remove_dc_offset = False
    opts.frame_opts.window_type = "hann"
    opts.frame_opts.snip_edges = True
    
    opts.mel_opts.num_bins = n_mels
    opts.mel_opts.low_freq = 0
    opts.mel_opts.high_freq = sample_rate / 2
    opts.mel_opts.is_librosa = True  # Use librosa-style mel
    
    # Create fbank computer
    fbank = knf.OnlineFbank(opts)
    
    # Accept waveform
    fbank.accept_waveform(sample_rate, audio.tolist())
    fbank.input_finished()
    
    # Get features
    num_frames = fbank.num_frames_ready
    mel_spec = []
    for i in range(num_frames):
        frame = fbank.get_frame(i)
        mel_spec.append(frame)
    
    mel_spec = np.array(mel_spec)
    
    return mel_spec


def run_zipvoice_inference(
    text_model_path,
    flow_matching_model_path,
    vocoder_path,
    tokens_file,
    phonemes,
    prompt_phonemes,
    prompt_audio_path=None,
    output_wav="output.wav",
    speed=1.0,
    num_steps=10,
    feat_scale=0.1,
    t_shift=0.5,
    guidance_scale=1.0,
):
    # Load token mappings
    print("Loading tokens...")
    token2id = load_tokens(tokens_file)
    print(f"Loaded {len(token2id)} tokens")
    
    # Convert phonemes to IDs
    print(f"Converting phonemes to IDs...")
    token_ids = phonemes_to_ids(phonemes, token2id)
    prompt_token_ids = phonemes_to_ids(prompt_phonemes, token2id)
    
    print(f"Text tokens: {token_ids.shape} - {token_ids}")
    print(f"Prompt tokens: {prompt_token_ids.shape} - {prompt_token_ids}")
    
    # Prepare inputs
    tokens_input = token_ids.reshape(1, -1)  # [1, seq_len]
    prompt_tokens_input = prompt_token_ids.reshape(1, -1)  # [1, prompt_len]
    
    # Load or create prompt features (mel spectrogram)
    feat_dim = 100
    
    if prompt_audio_path:
        print(f"Loading prompt audio from: {prompt_audio_path}")
        # Load audio file
        prompt_audio, sr = sf.read(prompt_audio_path)
        
        # Resample if needed (ZipVoice uses 24kHz)
        target_sr = 24000
        if sr != target_sr:
            print(f"Resampling from {sr}Hz to {target_sr}Hz")
            from scipy import signal as sig
            num_samples = int(len(prompt_audio) * target_sr / sr)
            prompt_audio = sig.resample(prompt_audio, num_samples)
        
        # Extract mel spectrogram using kaldi-native-fbank
        mel = extract_mel_spectrogram_kaldi(prompt_audio, target_sr, n_mels=feat_dim)
        
        # Scale by feat_scale
        mel = mel * feat_scale
        
        # Add batch dimension [1, time, mels]
        prompt_features = mel[np.newaxis, :, :].astype(np.float32)
        prompt_num_frames = mel.shape[0]
        
        print(f"Extracted mel spectrogram: {prompt_features.shape}")
    else:
        # Use dummy features if no prompt audio provided
        print("No prompt audio provided, using random noise")
        prompt_num_frames = max(5, len(prompt_phonemes) // 2)
        prompt_features = np.random.randn(1, prompt_num_frames, feat_dim).astype(np.float32) * feat_scale
    
    prompt_feat_len = np.array([prompt_num_frames], dtype=np.int64)
    speed_input = np.array([speed], dtype=np.float32)
    
    # Load and run text encoder
    print("\nRunning text encoder...")
    text_sess = ort.InferenceSession(text_model_path)
    text_outputs = text_sess.run(
        None,
        {
            "tokens": tokens_input,
            "prompt_tokens": prompt_tokens_input,
            "prompt_features_len": prompt_feat_len,
            "speed": speed_input,
        }
    )
    text_condition = text_outputs[0]
    print(f"Text condition shape: {text_condition.shape}")
    
    # Flow matching inference
    print(f"\nRunning flow matching model ({num_steps} steps)...")
    batch_size = 1
    num_frames = text_condition.shape[1]
    
    # Initialize x with random noise
    x = np.random.randn(batch_size, num_frames, feat_dim).astype(np.float32)
    
    # Initialize speech condition (zero-padded prompt features)
    speech_condition = np.zeros((batch_size, num_frames, feat_dim), dtype=np.float32)
    speech_condition[:, :prompt_num_frames, :] = prompt_features
    
    # Compute timesteps
    timesteps = []
    for i in range(num_steps + 1):
        t = i / num_steps
        timesteps.append(t_shift * t / (1.0 + (t_shift - 1.0) * t))
    
    guidance_scale_input = np.array([guidance_scale], dtype=np.float32)
    
    # Load flow matching model
    fm_sess = ort.InferenceSession(flow_matching_model_path)
    
    # Iterative refinement
    for step in range(num_steps):
        t = np.array([timesteps[step]], dtype=np.float32)
        
        fm_outputs = fm_sess.run(
            None,
            {
                "t": t,
                "x": x,
                "text_condition": text_condition,
                "speech_condition": speech_condition,
                "guidance_scale": guidance_scale_input,
            }
        )
        v = fm_outputs[0]
        
        delta_t = timesteps[step + 1] - timesteps[step]
        x = x + v * delta_t
        
        if (step + 1) % 5 == 0:
            print(f"  Step {step + 1}/{num_steps}")
    
    # Remove prompt frames
    mel = x[:, prompt_num_frames:, :]
    print(f"Generated mel shape: {mel.shape}")
    
    # Transpose mel: [batch, time, feat] -> [batch, feat, time]
    mel = np.transpose(mel, (0, 2, 1))
    
    # Scale back
    mel = mel / feat_scale
    
    # Run vocoder
    print("\nRunning vocoder...")
    vocoder_sess = ort.InferenceSession(vocoder_path)
    audio = vocoder_sess.run(None, {"mels": mel})[0]
    
    # Get audio array
    audio = audio.flatten()
    print(f"Generated audio shape: {audio.shape}")
    
    # Save to WAV file
    sample_rate = 24000  # ZipVoice default
    audio_int16 = (audio * 32767).astype(np.int16)
    
    with wave.open(output_wav, 'wb') as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(sample_rate)
        wf.writeframes(audio_int16.tobytes())
    
    print(f"\nAudio saved to: {output_wav}")
    print(f"Duration: {len(audio) / sample_rate:.2f}s")


if __name__ == "__main__":
    # Example usage with hardcoded phonemes
    # Using simple phonemes from the ZipVoice tokens.txt
    
    # Hardcoded phonemes for "hello" (using simple tokens from tokens.txt)
    # These are basic IPA phonemes: h-e-l-o
    text_phonemes = ['h', 'e', 'l', 'o', ' ', 'w', 'ɜː', 'l', 'd']
    
    # Prompt phonemes (shorter reference)
    prompt_phonemes = ['h', 'a', 'i']
    
    # Model paths
    MODEL_DIR = "sherpa-onnx-zipvoice-distill-zh-en-emilia"
    TEXT_MODEL = f"{MODEL_DIR}/text_encoder.onnx"
    FLOW_MATCHING_MODEL = f"{MODEL_DIR}/fm_decoder.onnx"
    VOCODER = f"{MODEL_DIR}/vocos_24khz.onnx"
    TOKENS_FILE = f"{MODEL_DIR}/tokens.txt"
    
    print("="*60)
    print("ZipVoice Minimal Inference Example")
    print("="*60)
    print(f"\nText phonemes: {text_phonemes}")
    print(f"Prompt phonemes: {prompt_phonemes}")
    
    # Use the actual prompt audio file that came with the model
    PROMPT_AUDIO = f"{MODEL_DIR}/prompt.wav"
    
    run_zipvoice_inference(
        text_model_path=TEXT_MODEL,
        flow_matching_model_path=FLOW_MATCHING_MODEL,
        vocoder_path=VOCODER,
        tokens_file=TOKENS_FILE,
        phonemes=text_phonemes,
        prompt_phonemes=prompt_phonemes,
        prompt_audio_path=PROMPT_AUDIO,
        output_wav="output.wav",
        speed=1.0,
        num_steps=10,
    )

@pkufool
Copy link
Copy Markdown
Contributor Author

pkufool commented Oct 28, 2025

Amazing work done here! any chance you have a pure Python script that uses ZipVoice?

Please visit https://github.com/k2-fsa/ZipVoice

@UncleLLD
Copy link
Copy Markdown

sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/vocos_24khz.onnx
Each line in vocab should contain at lease three items (seperate by space), the first one is Chinese word/character, the second one is score, the others are the corresponding pinyins given : 一 y0 i1

@csukuangfj
Copy link
Copy Markdown
Collaborator

sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/vocos_24khz.onnx Each line in vocab should contain at lease three items (seperate by space), the first one is Chinese word/character, the second one is score, the others are the corresponding pinyins given : 一 y0 i1

Can you please tell us about the issue you're experiencing? @UncleLLD

@UncleLLD
Copy link
Copy Markdown

sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/vocos_24khz.onnx Each line in vocab should contain at lease three items (seperate by space), the first one is Chinese word/character, the second one is score, the others are the corresponding pinyins given : 一 y0 i1

Can you please tell us about the issue you're experiencing? @UncleLLD

after I update sherpa-onnx, the problem was gone, thanks. Saved to ./generated.wav
The text is '小米的价值观是真诚, 热爱. 真诚,就是不欺人也不自欺. 热爱, 就是全心投入并享受其中.'
Elapsed seconds: 6.937
Audio duration in seconds: 7.861
RTF: 6.937/7.861 = 0.882

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants