Add Zipvoice#2487
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds zero-shot ZipVoice TTS: new ZipVoice frontend, dual-ONNX ZipVoice model and impl, vocoder selection, ZipVoice config/metadata and bindings, CLI and Python examples, text utilities, tests, and a CMake module to fetch cppinyin. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as User (CLI/Python)
participant API as OfflineTts
participant Impl as OfflineTtsZipvoiceImpl
participant FE as ZipVoice Frontend
participant ZM as ZipVoice Model (text + flow)
participant VOC as Vocoder
participant IO as File I/O
User->>API: Generate(text, prompt_text, prompt_samples, sr, speed, num_steps, cb)
API->>Impl: forward call
Impl->>FE: ConvertTextToTokenIds(text, voice)
FE->>IO: load tokens/pinyin/espeak data (once)
FE-->>Impl: tokens, prompt_tokens
Impl->>Impl: ComputeMelSpectrogram(prompt_samples, sr)
Impl->>ZM: Run(tokens, prompt_tokens, prompt_features, speed, num_steps)
ZM-->>Impl: mel frames
Impl->>VOC: Synthesize(mel)
VOC-->>Impl: waveform
Impl-->>API: GeneratedAudio
API-->>User: WAV + metadata
rect rgba(200,255,200,0.12)
note over FE,ZM: New ZipVoice zero-shot flow (pinyin/espeak + dual-model loop)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for ZipVoice zero-shot text-to-speech (TTS) models to sherpa-onnx. ZipVoice is a flow-matching-based TTS model that can generate speech with voice cloning capabilities using a prompt audio sample and its transcription.
Key changes include:
- Implementation of ZipVoice model architecture with text encoder and flow-matching decoder
- New ZipVoice frontend for text processing with support for Chinese pinyin and multilingual text
- Addition of zero-shot TTS interface that accepts prompt audio and text for voice cloning
Reviewed Changes
Copilot reviewed 32 out of 33 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| sherpa-onnx/python/sherpa_onnx/init.py | Add ZipVoice model config import |
| sherpa-onnx/python/csrc/offline-tts.cc | Add Python bindings for zero-shot TTS generation |
| sherpa-onnx/csrc/offline-tts-zipvoice-* | Core ZipVoice model implementation and configuration |
| sherpa-onnx/csrc/text-utils.* | Add UTF-8/UTF-32 conversion and CJK text detection utilities |
| sherpa-onnx/csrc/offline-tts.* | Add zero-shot TTS interface to main TTS class |
| sherpa-onnx/csrc/vocoder.cc | Update vocoder to support ZipVoice models |
| python-api-examples/offline-zeroshot-tts.py | Python example for zero-shot TTS usage |
| cmake/cppinyin.cmake | Add cppinyin dependency for Chinese text processing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| bool StringToBool(const std::string &s) { | ||
| std::string lower = s; | ||
| std::transform(lower.begin(), lower.end(), lower.begin(), ::tolower); |
There was a problem hiding this comment.
Creating a copy of the string for case conversion is inefficient. Consider using std::transform directly on a pre-allocated string or using the existing ToLowerAscii function.
| std::transform(lower.begin(), lower.end(), lower.begin(), ::tolower); | |
| std::string lower = ToLowerAscii(s); |
| } | ||
| if (debug_) { | ||
| debug_oss << "Tokens and IDs: \n"; | ||
| for (int32_t i; i < tokens.size(); i++) { |
There was a problem hiding this comment.
Variable 'i' is declared but not initialized in the for loop. This should be 'int32_t i = 0'.
| for (int32_t i; i < tokens.size(); i++) { | |
| for (int32_t i = 0; i < tokens.size(); i++) { |
| SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); | ||
| buffer = ReadFile(config.matcha.vocoder); | ||
| } else if (!config.zipvoice.vocoder.empty()) { | ||
| SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s", |
There was a problem hiding this comment.
Using SHERPA_ONNX_LOGE for informational messages about vocoder selection is misleading. These should use SHERPA_ONNX_LOGI or similar for info-level logging.
| SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s", | |
| SHERPA_ONNX_LOGI("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); | |
| buffer = ReadFile(config.matcha.vocoder); | |
| } else if (!config.zipvoice.vocoder.empty()) { | |
| SHERPA_ONNX_LOGI("Using zipvoice vocoder: %s", |
| SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); | ||
| buffer = ReadFile(mgr, config.matcha.vocoder); | ||
| } else if (!config.zipvoice.vocoder.empty()) { | ||
| SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s", |
There was a problem hiding this comment.
Using SHERPA_ONNX_LOGE for informational messages about vocoder selection is misleading. These should use SHERPA_ONNX_LOGI or similar for info-level logging.
| SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s", | |
| SHERPA_ONNX_LOGI("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); | |
| buffer = ReadFile(mgr, config.matcha.vocoder); | |
| } else if (!config.zipvoice.vocoder.empty()) { | |
| SHERPA_ONNX_LOGI("Using zipvoice vocoder: %s", |
| "Path to zipvoice flow-matching model"); | ||
| po->Register("zipvoice-vocoder", &vocoder, "Path to zipvoice vocoder"); | ||
| po->Register("zipvoice-num-steps", &num_steps, | ||
| "Number of inference steps for ZipVoice (default: 16)"); |
There was a problem hiding this comment.
The default value in the constructor is 16, but in the header file the default is also 16, yet in some other files it's referenced as 4. This inconsistency could cause confusion.
| "Number of inference steps for ZipVoice (default: 16)"); | |
| "Number of inference steps for ZipVoice (default: 4)"); |
| oss << parts[i]; | ||
| t_lang = types[i]; | ||
| } else { | ||
| if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) { |
There was a problem hiding this comment.
The logical condition (types[i] != "tag" || types[i] != "pinyin") will always be true since a type cannot be both "tag" and "pinyin" simultaneously. This should likely be && instead of ||.
| if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) { | |
| if (t_lang == "other" && (types[i] != "tag" && types[i] != "pinyin")) { |
There was a problem hiding this comment.
Actionable comments posted: 19
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cmake/cppinyin.cmake (1)
74-75: Avoid auto-invoking download_cppinyin() in the moduleConfirmed that v0.10 of cppinyin defines the
cppinyin_coretarget, so removing the in-module download is safe.• cmake/cppinyin.cmake (around lines 74–75): remove the unconditional call to
download_cppinyin()
• Top-level CMakeLists.txt: invokedownload_cppinyin()only whenSHERPA_ONNX_ENABLE_TTSis ONSuggested diff:
--- a/cmake/cppinyin.cmake +++ b/cmake/cppinyin.cmake @@ 74,75 - download_cppinyin()sherpa-onnx/python/csrc/offline-tts.cc (1)
68-81: Harden the Python-callback bridge against exceptionsIf the Python callback raises, the exception will propagate across the C++ callback boundary and can crash the synthesis pipeline unpredictably. Wrap the call in try/catch and return 0 on failure to signal “stop,” plus optionally log.
Apply this diff inside callback_wrapper:
- return callback(array, progress); + try { + return callback(array, progress); + } catch (const py::error_already_set &e) { + // Log and request early stop + py::print("offline_tts.generate callback raised:", e.what()); + return 0; + }
🧹 Nitpick comments (41)
sherpa-onnx/csrc/text-utils-test.cc (4)
85-89: Deduplicate redundant tests for the same 0x20 0xC4 caseThere are 4 near-identical tests asserting that a space followed by 0xC4 results in a single space. This bloats test runtime and maintenance surface without adding coverage.
You can consolidate these into a single test or a small parameterized test:
-TEST(RemoveInvalidUtf8Sequences, BreakingCase_SpaceFollowedByInvalidByte) { - std::string input = "\x20\xC4"; // Space followed by an invalid byte - std::string expected = " "; // 0xC4 removed - EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected); -} -... -TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidByte_Breaking) { - std::string input = "\x20\xc4"; // Space followed by invalid `0xc4` - std::string expected = " "; // `0xc4` should be removed, space remains - EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected); -} +TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidC4) { + for (std::string input : {std::string("\x20\xC4", 2), std::string("\x20\xc4", 2)}) { + EXPECT_EQ(RemoveInvalidUtf8Sequences(input), " "); + } +}Also applies to: 102-106, 108-112, 114-118
133-143: Turn SplitUtf8(SplitZhAndEn) into an assertion-based test (avoid stdout-only tests)This test only prints tokens and never asserts, so it cannot catch regressions.
Minimal conversion into a real test:
TEST(SplitUtf8, SplitZhAndEn) { std::string text = "Hello, 世界! 123. 你好<ha3>, 你好<ha4>! world is beautiful."; auto words = SplitUtf8(text); - for (auto &word : words) { - std::cout << word << " ## "; - } - std::cout << "\n"; + // Basic sanity checks without overspecifying tokenization + ASSERT_FALSE(words.empty()); + // Should contain both English and CJK chunks + bool has_hello = std::find(words.begin(), words.end(), "Hello") != words.end(); + bool has_world = std::find(words.begin(), words.end(), "world") != words.end(); + bool has_cjk = std::any_of(words.begin(), words.end(), + [](const std::string& w){ return ContainsCJK(w); }); + EXPECT_TRUE(has_hello || has_world); + EXPECT_TRUE(has_cjk); }Note: Add
#include <algorithm>at the top of this test file.
165-171: Cast to unsigned char before calling std::isalpha to avoid UB
std::isalphais undefined for negativecharvalues. Even though you constrain toword.size() == 1, defensive casting is best practice.- if (word.size() == 1 && std::isalpha(word[0])) { + if (word.size() == 1 && + std::isalpha(static_cast<unsigned char>(word[0]))) {
181-215: Make SplitUtf8(SplitRegex) assertive and avoid console output in testsThis test prints intermediate tokens and “Sentence: …” lines but never asserts on expected segmentation or language tags. It will pass regardless of correctness.
- Add assertions for a few key expectations (e.g., that / are tagged as “pinyin”, bracketed [S1]/[S2] as “tag”, and that at least one “zh” and “en” segment exist).
- Consider extracting the segmentation into a helper and adding focused unit tests for:
- correct grouping across “other”
- transitions across “tag”/“pinyin”
- ContainsCJK-driven classification
If helpful, I can propose a concrete assertion set tailored to the current tokenizer behavior.
sherpa-onnx/csrc/text-utils.cc (2)
727-735: codecvt/wstring_convert is deprecated; consider a future-proof conversion pathBoth Utf8ToUtf32/Utf32ToUtf8 rely on
std::wstring_convert<std::codecvt_utf8<...>>, deprecated since C++17 and removed in some C++20+ libstdc++/libc++ builds. The file already uses codecvt elsewhere, but it’s worth isolating this behind a helper/adapter to ease future replacement.
- Introduce a small utf conversion shim (single header) and centralize all conversions there.
- Consider alternatives: ICU, Boost.Locale, or a tiny embedded UTF-8 library (e.g., utf8cpp) for UTF-8⇄UTF-32.
- At minimum, add compile-time deprecation suppression guards and a TODO pointing to a planned migration.
777-790: Tighten StringToBool parsing; avoid accepting partial numeric tokens (e.g., "3abc")
std::istringstreamwill parse an initial integer prefix and ignore trailing junk, causing"3abc"to return true. Also, leading/trailing whitespace and synonyms (“yes/no”, “on/off”, “y/n”) might be expected.Consider:
- Trim whitespace before checks.
- Accept common synonyms.
- Require full numeric parse (no trailing characters). Example fix:
bool StringToBool(const std::string &s) { - std::string lower = s; + std::string lower = s; std::transform(lower.begin(), lower.end(), lower.begin(), ::tolower); if (lower == "true" || lower == "1") return true; if (lower == "false" || lower == "0") return false; - // Try istringstream for numbers - std::istringstream is(lower); - int32_t num; - if (is >> num) return num != 0; + if (lower == "yes" || lower == "y" || lower == "on") return true; + if (lower == "no" || lower == "n" || lower == "off") return false; + + // Strict numeric parse: no trailing characters + std::istringstream is(lower); + is >> std::ws; + int32_t num; + if (is >> num) { + is >> std::ws; + if (is.eof()) return num != 0; + } return false; // default if not matched }sherpa-onnx/csrc/text-utils.h (1)
152-174: Public API additions look good; consider documenting deprecation caveatThe new declarations are coherent and match definitions. Since implementations use
wstring_convert/codecvt, consider a brief note in the header (or docs) that these functions may change implementation in future to move away from deprecated codecvt.python-api-examples/offline-tts.py (1)
456-458: Route error output to stderrTiny polish: since this is an error path, print to stderr so it doesn’t get mixed with normal output.
Apply this diff:
@@ - print( - "Error in generating audios. Please read previous error messages." - ) + print( + "Error in generating audios. Please read previous error messages.", + file=sys.stderr, + )And add the import once near the other imports:
@@ -import argparse +import argparse +import syscmake/cppinyin.cmake (5)
10-26: Offline tarball probing — add env/cache overrides; quote pathsAllow users/CI to specify tarball via env or cache var; and keep quoting consistent.
@@ - set(possible_file_locations + # Allow overrides + if(DEFINED ENV{CPPINYIN_TARBALL}) + list(INSERT possible_file_locations 0 $ENV{CPPINYIN_TARBALL}) + endif() + if(CPPINYIN_TARBALL) + list(INSERT possible_file_locations 0 ${CPPINYIN_TARBALL}) + endif() + + set(possible_file_locations $ENV{HOME}/Downloads/cppinyin-0.10.tar.gz ${CMAKE_SOURCE_DIR}/cppinyin-0.10.tar.gz ${CMAKE_BINARY_DIR}/cppinyin-0.10.tar.gz /tmp/cppinyin-0.10.tar.gz /star-fj/fangjun/download/github/cppinyin-0.10.tar.gz ) @@ - if(EXISTS ${f}) - set(cppinyin_URL "${f}") - file(TO_CMAKE_PATH "${cppinyin_URL}" cppinyin_URL) + if(EXISTS "${f}") + set(cppinyin_URL "${f}") + file(TO_CMAKE_PATH "${cppinyin_URL}" cppinyin_URL) message(STATUS "Found local downloaded cppinyin: ${cppinyin_URL}") set(cppinyin_URL2) break() endif() endforeach()
31-38: FetchContent_Declare: guard optional secondary URLPassing an empty string as a URL argument is harmless in many cases but fragile across CMake versions. Guard it.
- FetchContent_Declare(cppinyin - URL - ${cppinyin_URL} - ${cppinyin_URL2} - URL_HASH - ${cppinyin_HASH} - ) + if(cppinyin_URL2) + FetchContent_Declare(cppinyin + URL ${cppinyin_URL} ${cppinyin_URL2} + URL_HASH ${cppinyin_HASH}) + else() + FetchContent_Declare(cppinyin + URL ${cppinyin_URL} + URL_HASH ${cppinyin_HASH}) + endif()
46-61: Temporarily toggling BUILD_SHARED_LIBS — acceptable; add a safety restoreThis pattern is OK, but make the restore unconditional to avoid surprises if control flow changes later.
- if(_build_shared_libs_bak) - set_target_properties(cppinyin_core - PROPERTIES - POSITION_INDEPENDENT_CODE ON - C_VISIBILITY_PRESET hidden - CXX_VISIBILITY_PRESET hidden - ) - set(BUILD_SHARED_LIBS ON) - endif() + if(_build_shared_libs_bak) + set_target_properties(cppinyin_core + PROPERTIES + POSITION_INDEPENDENT_CODE ON + C_VISIBILITY_PRESET hidden + CXX_VISIBILITY_PRESET hidden) + endif() + # Always restore to original value + if(DEFINED _build_shared_libs_bak) + set(BUILD_SHARED_LIBS ${_build_shared_libs_bak}) + unset(_build_shared_libs_bak) + endif()
63-70: Target interface includes — prefer INTERFACE for consumer-only exposureSince we don’t build sources here, exposing includes as INTERFACE makes intent clearer.
- target_include_directories(cppinyin_core - PUBLIC - ${cppinyin_SOURCE_DIR}/ - ) + target_include_directories(cppinyin_core + INTERFACE + ${cppinyin_SOURCE_DIR}/)
4-7: cppinyin v0.10 checksum verified; consider optional override support
- SHA256 for v0.10 tarball (
abe6584…19982) has been confirmed to match the release archive.- Optional: add a CMake cache variable so downstream users can supply their own mirror, path, or version without patching the script.
Suggested diff to introduce a
CPPINYIN_TARBALLoverride:function(download_cppinyin) include(FetchContent) - set(cppinyin_URL "https://github.com/pkufool/cppinyin/archive/refs/tags/v0.10.tar.gz") - set(cppinyin_URL2 "https://gh-proxy.com/https://github.com/pkufool/cppinyin/archive/refs/tags/v0.10.tar.gz") + set(CPPINYIN_TARBALL "" CACHE PATH + "Optional path or URL to a pre-downloaded cppinyin v0.10 tarball") + set(cppinyin_URL "https://github.com/pkufool/cppinyin/archive/refs/tags/v0.10.tar.gz") + set(cppinyin_URL2 "https://gh-proxy.com/https://github.com/pkufool/cppinyin/archive/refs/tags/v0.10.tar.gz") set(cppinyin_HASH "SHA256=abe6584d7ee56829e8f4b5fbda3b50ecdf49a13be8e413a78d1b0d5d5c019982")CMakeLists.txt (1)
369-369: Gate cppinyin fetch behind SHERPA_ONNX_ENABLE_TTS and call the function explicitlyIncluding cppinyin unconditionally triggers a fetch even for ASR-only builds. Move it into the TTS block and call download_cppinyin() there (remove the self-call inside the module per its review).
Patch:
@@ -include(cppinyin) include(simple-sentencepiece) @@ if(SHERPA_ONNX_ENABLE_TTS) message(STATUS "TTS is enabled") add_definitions(-DSHERPA_ONNX_ENABLE_TTS=1) + include(cppinyin) + download_cppinyin() else()sherpa-onnx/python/sherpa_onnx/__init__.py (1)
52-52: Re-export is fine; silence Ruff F401 for init or define all.The new symbol is intentionally re-exported from the extension module, which triggers Ruff F401 in this file. Either add a per-file directive or define all to make the intent explicit.
Apply one of the following minimal changes (preferred: directive):
Option A — add a per-file Ruff directive at the top of this file:
+# ruff: noqa: F401 # re-exports from extension module are public API from _sherpa_onnx import (Option B — explicitly export via all (trim list as needed):
+__all__ = [ + # ... keep existing, add new: + "OfflineTtsZipvoiceModelConfig", +]sherpa-onnx/csrc/offline-tts-model-config.cc (1)
42-44: Broaden ZipVoice selection sentinel to catch partial configs.Using only flow_matching_model as the sentinel may miss cases where text_model is set but flow_matching_model is accidentally omitted, leading to a generic “provide exactly one tts model” error instead of zipvoice.Validate()’s targeted diagnostics. Recommend triggering ZipVoice validation if either core path is set.
Apply this diff:
- if (!zipvoice.flow_matching_model.empty()) { + if (!zipvoice.flow_matching_model.empty() || !zipvoice.text_model.empty()) { return zipvoice.Validate(); }If docs/CLI examples instruct users to set text_model first, this change prevents confusing errors. Please verify other places in the codebase (parsers, examples) align with this sentinel.
sherpa-onnx/csrc/vocos-vocoder.cc (4)
45-55: Use SHERPA_ONNX_EXIT and guard against empty buffer before Init.
- Prefer SHERPA_ONNX_EXIT for consistency with other error paths and platform-specific logging.
- If ReadFile fails and returns an empty buffer, Init with length 0 will produce a less-informative runtime error. Fail fast with a clear message.
Apply this diff:
- std::vector<char> buffer; + std::vector<char> buffer; if (!config.matcha.vocoder.empty()) { buffer = ReadFile(config.matcha.vocoder); } else if (!config.zipvoice.vocoder.empty()) { buffer = ReadFile(config.zipvoice.vocoder); } else { - SHERPA_ONNX_LOGE("No vocoder model provided in the config!"); - exit(-1); + SHERPA_ONNX_LOGE("No vocoder model provided in the config!"); + SHERPA_ONNX_EXIT(-1); } - Init(buffer.data(), buffer.size()); + if (buffer.empty()) { + SHERPA_ONNX_LOGE("Vocoder file read returned empty buffer."); + SHERPA_ONNX_EXIT(-1); + } + Init(buffer.data(), buffer.size());
63-73: Mirror the same error-handling improvements for Manager-based ctor.Keep both constructors in sync for consistent behavior across platforms.
Apply this diff:
- std::vector<char> buffer; + std::vector<char> buffer; if (!config.matcha.vocoder.empty()) { buffer = ReadFile(mgr, config.matcha.vocoder); } else if (!config.zipvoice.vocoder.empty()) { buffer = ReadFile(mgr, config.zipvoice.vocoder); } else { - SHERPA_ONNX_LOGE("No vocoder model provided in the config!"); - exit(-1); + SHERPA_ONNX_LOGE("No vocoder model provided in the config!"); + SHERPA_ONNX_EXIT(-1); } - Init(buffer.data(), buffer.size()); + if (buffer.empty()) { + SHERPA_ONNX_LOGE("Vocoder file read returned empty buffer."); + SHERPA_ONNX_EXIT(-1); + } + Init(buffer.data(), buffer.size());
39-45: Minor: remove unused allocator_ member (and its init) to reduce noise.allocator_ is never referenced; there is a local allocator in Init(). Safe to drop.
Apply these diffs:
Constructor initializer lists:
- sess_opts_(GetSessionOptions(config.num_threads, config.provider)), - allocator_{} { + sess_opts_(GetSessionOptions(config.num_threads, config.provider)) {- sess_opts_(GetSessionOptions(config.num_threads, config.provider)), - allocator_{} { + sess_opts_(GetSessionOptions(config.num_threads, config.provider)) {Private member:
- Ort::AllocatorWithDefaultOptions allocator_;
45-55: Avoid duplicated selection logic (optional).Both ctors duplicate model selection. Consider extracting a small helper (or lambda) to choose and read the vocoder once, reducing drift risk.
Also applies to: 63-73
sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (3)
8-10: Remove unused include to keep headers leanstd::string is not used in this header. Dropping it reduces compile surface area.
-#include <cstdint> -#include <string> +#include <cstdint>
13-15: Tighten up the comment wordingMinor grammar polish for clarity.
-// If you are not sure what each field means, please -// -// have a look of the Python file in the model directory that -// -// you have downloaded. +// If you are not sure what each field means, please +// look at the Python file in the model directory that +// you have downloaded.
24-26: Use booleans for boolean flagsuse_espeak and use_pinyin are consumed as boolean toggles elsewhere. Using bool communicates intent and prevents accidental non-0/1 assignments.
Please confirm no ABI/serialization expectations require these to be int32_t.
- int32_t use_espeak = 1; - int32_t use_pinyin = 1; + bool use_espeak = true; + bool use_pinyin = true;As follow-up, usages like
if (meta_data_.use_pinyin)and designated initializers in tests already align with bool.sherpa-onnx/csrc/offline-tts.cc (1)
199-237: More robust conversion fallback and log scoping (optional)If GB2312 conversion fails (returns empty), the code will forward an empty string. Consider guarding conversions and scoping logs per-argument.
- Only replace with the converted string if conversion produced non-empty.
- Keep separate “printed” flags for
textandprompt_text, or log unconditionally at debug level.sherpa-onnx/csrc/offline-tts-zipvoice-frontend-test.cc (3)
18-22: Use GTEST_SKIP() instead of return for missing test assetsReturning early marks the test as passed, hiding skipped status in CI. Prefer
GTEST_SKIP()to make intent explicit and visible.- if (!FileExists(data_dir + "/en_dict")) { - SHERPA_ONNX_LOGE("%s/en_dict does not exist. Skipping test", - data_dir.c_str()); - return; - } + if (!FileExists(data_dir + "/en_dict")) { + SHERPA_ONNX_LOGE("%s/en_dict does not exist. Skipping test", + data_dir.c_str()); + GTEST_SKIP(); + } - if (!FileExists(data_dir + "/phontab")) { + if (!FileExists(data_dir + "/phontab")) { ... - return; + GTEST_SKIP(); } - if (!FileExists(data_dir + "/phonindex")) { + if (!FileExists(data_dir + "/phonindex")) { ... - return; + GTEST_SKIP(); } - if (!FileExists(data_dir + "/phondata")) { + if (!FileExists(data_dir + "/phondata")) { ... - return; + GTEST_SKIP(); } - if (!FileExists(data_dir + "/intonations")) { + if (!FileExists(data_dir + "/intonations")) { ... - return; + GTEST_SKIP(); }Also applies to: 24-28, 30-34, 36-40, 42-46
48-58: Skip with GTEST when dictionary/token files are absentSame rationale as above for
pinyin_dictandtokens_file.- if (!FileExists(pinyin_dict)) { - SHERPA_ONNX_LOGE("%s does not exist. Skipping test", pinyin_dict.c_str()); - return; - } + if (!FileExists(pinyin_dict)) { + SHERPA_ONNX_LOGE("%s does not exist. Skipping test", pinyin_dict.c_str()); + GTEST_SKIP(); + } ... - if (!FileExists(tokens_file)) { - SHERPA_ONNX_LOGE("%s does not exist. Skipping test", tokens_file.c_str()); - return; - } + if (!FileExists(tokens_file)) { + SHERPA_ONNX_LOGE("%s does not exist. Skipping test", tokens_file.c_str()); + GTEST_SKIP(); + }
65-76: Add minimal assertions so the test validates behaviorCurrently the test only exercises code. Add basic sanity checks to ensure tokenization produced content.
std::string text = "how are you doing?"; std::vector<sherpa_onnx::TokenIDs> ans = frontend.ConvertTextToTokenIds(text, "en-us"); + EXPECT_FALSE(ans.empty()); + EXPECT_FALSE(ans[0].tokens.empty()); text = "这是第一句。这是第二句。"; ans = frontend.ConvertTextToTokenIds(text, "en-us"); + EXPECT_FALSE(ans.empty()); + EXPECT_FALSE(ans[0].tokens.empty()); text = "这是第一句。这是第二句。<pin1><yin2>测试 [S1]and hello " "world[S2]这是第三句。"; ans = frontend.ConvertTextToTokenIds(text, "en-us"); + EXPECT_FALSE(ans.empty()); + EXPECT_FALSE(ans[0].tokens.empty());sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (1)
62-66: Clarify flag descriptionsSmall wording improvements for clarity and accuracy.
- po.Register("prompt-text", &prompt_text, "The transcribe of prompt_samples."); + po.Register("prompt-text", &prompt_text, + "The transcription of the prompt audio."); po.Register("prompt-audio", &prompt_audio, - "The prompt audio file, single channel pcm. "); + "Path to the prompt audio file (mono, 16-bit PCM WAV).");sherpa-onnx/csrc/offline-tts-impl.cc (1)
45-48: Strengthen ZipVoice selection guard (also require tokens)Create() selects ZipVoice when text_model and flow_matching_model are set. If tokens is missing, frontend initialization will later fail. Either rely on Validate() being called before Create(), or add a lightweight guard on tokens here to fail fast.
Apply this diff to both overloads:
-} else if (!config.model.zipvoice.text_model.empty() && - !config.model.zipvoice.flow_matching_model.empty()) { +} else if (!config.model.zipvoice.text_model.empty() && + !config.model.zipvoice.flow_matching_model.empty() && + !config.model.zipvoice.tokens.empty()) { return std::make_unique<OfflineTtsZipvoiceImpl>(config);-} else if (!config.model.zipvoice.text_model.empty() && - !config.model.zipvoice.flow_matching_model.empty()) { +} else if (!config.model.zipvoice.text_model.empty() && + !config.model.zipvoice.flow_matching_model.empty() && + !config.model.zipvoice.tokens.empty()) { return std::make_unique<OfflineTtsZipvoiceImpl>(mgr, config);Confirm that OfflineTtsModelConfig::Validate() enforces tokens presence for ZipVoice so we don’t duplicate checks.
Also applies to: 67-69
sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)
26-31: Fix tensor shape doc to match implementationImpl returns a tensor shaped {batch_size, keep_frames, feat_dim}, where feat_dim == meta_data_.feat_dim. The comment currently states (batch_size, mel_dim, num_frames), which is inverted and uses different naming.
Apply this diff to correct the documentation:
- // Return a float32 tensor containing the mel - // of shape (batch_size, mel_dim, num_frames) + // Return a float32 mel-spectrogram-like tensor + // of shape (batch_size, num_frames, feat_dim). + // Note: feat_dim == meta_data_.feat_dim; num_frames excludes the prompt frames.sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
49-57: Consider making punct_map_ static to avoid per-instance duplicationpunct_map_ is immutable. Making it static constexpr (or function-static) avoids re-creating the map per instance and reduces binary size.
Example:
- const std::unordered_map<std::string, std::string> punct_map_ = { + static const std::unordered_map<std::string, std::string> punct_map_ = { {",", ","}, {"。", "."}, {"!", "!"}, {"?", "?"}, {";", ";"}, {":", ":"}, {"、", ","}, {"‘", "'"}, {"“", "\""}, {"”", "\""}, {"’", "'"}, {"⋯", "…"}, {"···", "…"}, {"・・・", "…"}, {"...", "…"}};sherpa-onnx/python/csrc/offline-tts.cc (1)
89-122: New overload looks solid; consider clarifying/optimizing argument passing
- API shape and defaults (speed=1.0, num_steps=4) match the C++ side. The GIL is released correctly.
- Minor: passing prompt_samples as std::vector causes an extra copy from NumPy. If you want to micro-opt, accept py::array_t and materialize a std::vector only once just before calling into C++ (optional).
Example alternative signature (for future consideration):
- const std::vector<float> &prompt_samples, int32_t sample_rate, + const py::array_t<float> &prompt_samples, int32_t sample_rate,and then convert to std::vector once before invoking self.Generate.
python-api-examples/offline-zeroshot-tts.py (3)
178-182: Use store_true for boolean flagsArgparse with type=bool is brittle (e.g., --debug false still parses as True). Prefer action="store_true".
- parser.add_argument( - "--debug", - type=bool, - default=False, - help="True to show debug messages", - ) + parser.add_argument( + "--debug", + action="store_true", + help="Show debug messages", + )
121-143: Prefer explicit errors over asserts in read_wave()Asserts can be stripped and yield confusing failures. Raise ValueError with clear messages so users understand channel/bit-depth requirements.
- with wave.open(wave_filename) as f: - assert f.getnchannels() == 1, f.getnchannels() - assert f.getsampwidth() == 2, f.getsampwidth() # it is in bytes + with wave.open(wave_filename) as f: + if f.getnchannels() != 1: + raise ValueError(f"Expected mono WAV. Got {f.getnchannels()} channels") + if f.getsampwidth() != 2: + raise ValueError(f"Expected 16-bit PCM WAV. Got sample width={f.getsampwidth()} bytes")
257-266: Optional: demonstrate progress callback usageSince the Python binding supports a progress callback, add an example toggle in the script to showcase streaming writes or early-stop behavior. Keeps the example self-explanatory for advanced users.
Happy to provide a small snippet if you’d like.
sherpa-onnx/csrc/offline-tts-impl.h (1)
12-12: Include if throwing exceptions from default methodsIf you choose the throwing approach above, add the header.
#include <vector> -#include "sherpa-onnx/csrc/macros.h" +#include "sherpa-onnx/csrc/macros.h" +#include <stdexcept>sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (1)
83-98: Consider extracting required espeak-ng files to a constant.The list of required espeak-ng files is hardcoded within the validation method. For better maintainability and potential reuse, consider extracting this to a static constant or a separate function.
+ static const std::vector<std::string> kRequiredEspeakFiles = { + "phontab", + "phonindex", + "phondata", + "intonations", + }; + if (!data_dir.empty()) { - std::vector<std::string> required_files = { - "phontab", - "phonindex", - "phondata", - "intonations", - }; - for (const auto &f : required_files) { + for (const auto &f : kRequiredEspeakFiles) { if (!FileExists(data_dir + "/" + f)) {sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1)
21-23: Consider clarifying the comment about lexicon behavior.The comment states "If data_dir is given, lexicon is ignored" but there's no mention of a lexicon field in this struct. This might be confusing for users.
- // If data_dir is given, lexicon is ignored - // data_dir is for piper-phonemize, which uses espeak-ng + // data_dir is for piper-phonemize, which uses espeak-ng + // When provided, espeak-ng will be used for phonemizationsherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (1)
121-121: Consider making tone parameter configurable.The tone parameter is hardcoded as
"number". Consider making it configurable or documenting why this specific format is used.sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (2)
106-108: Use cryptographically secure random number generation.Using
std::random_device{}()as a seed may not provide sufficient randomness on all platforms. Some implementations return a fixed value if hardware randomness is unavailable.Consider using a more robust seeding approach:
- std::default_random_engine rng(std::random_device{}()); + std::random_device rd; + std::seed_seq seed{rd(), rd(), rd(), rd()}; + std::mt19937 rng(seed);
167-175: Potential integer overflow in memory calculations.The multiplication
batch_size * keep_frames * feat_dimcould overflow for large values. While batch_size is checked to be 1, the other values come from the model and user input.Consider adding overflow checks or using
size_tfor size calculations:- std::vector<float> out_data(batch_size * keep_frames * feat_dim); + size_t out_size = static_cast<size_t>(batch_size) * keep_frames * feat_dim; + if (out_size > std::numeric_limits<size_t>::max() / sizeof(float)) { + SHERPA_ONNX_LOGE("Output size too large"); + exit(-1); + } + std::vector<float> out_data(out_size);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (33)
CMakeLists.txt(1 hunks)cmake/cppinyin.cmake(1 hunks)python-api-examples/offline-tts.py(3 hunks)python-api-examples/offline-zeroshot-tts.py(1 hunks)sherpa-onnx/csrc/CMakeLists.txt(5 hunks)sherpa-onnx/csrc/offline-tts-impl.cc(3 hunks)sherpa-onnx/csrc/offline-tts-impl.h(2 hunks)sherpa-onnx/csrc/offline-tts-matcha-impl.h(1 hunks)sherpa-onnx/csrc/offline-tts-model-config.cc(3 hunks)sherpa-onnx/csrc/offline-tts-model-config.h(3 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-frontend-test.cc(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-impl.h(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-model.cc(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-model.h(1 hunks)sherpa-onnx/csrc/offline-tts.cc(1 hunks)sherpa-onnx/csrc/offline-tts.h(1 hunks)sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc(1 hunks)sherpa-onnx/csrc/text-utils-test.cc(4 hunks)sherpa-onnx/csrc/text-utils.cc(1 hunks)sherpa-onnx/csrc/text-utils.h(1 hunks)sherpa-onnx/csrc/vocoder.cc(2 hunks)sherpa-onnx/csrc/vocos-vocoder.cc(2 hunks)sherpa-onnx/python/csrc/CMakeLists.txt(1 hunks)sherpa-onnx/python/csrc/offline-tts-model-config.cc(2 hunks)sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc(1 hunks)sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.h(1 hunks)sherpa-onnx/python/csrc/offline-tts.cc(1 hunks)sherpa-onnx/python/sherpa_onnx/__init__.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-06T04:23:50.237Z
Learnt from: litongjava
PR: k2-fsa/sherpa-onnx#2440
File: sherpa-onnx/java-api/src/main/java/com/k2fsa/sherpa/onnx/core/Core.java:4-6
Timestamp: 2025-08-06T04:23:50.237Z
Learning: The sherpa-onnx JNI library files are stored in Hugging Face repository at https://huggingface.co/csukuangfj/sherpa-onnx-libs under versioned directories like jni/1.12.7/, and the actual Windows JNI library filename is "sherpa-onnx-jni.dll" as defined in Core.java constants.
Applied to files:
sherpa-onnx/csrc/CMakeLists.txt
📚 Learning: 2025-08-06T04:18:47.981Z
Learnt from: litongjava
PR: k2-fsa/sherpa-onnx#2440
File: sherpa-onnx/java-api/src/main/java/com/k2fsa/sherpa/onnx/core/Core.java:4-6
Timestamp: 2025-08-06T04:18:47.981Z
Learning: In sherpa-onnx Java API, the native library names in Core.java (WIN_NATIVE_LIBRARY_NAME = "sherpa-onnx-jni.dll", UNIX_NATIVE_LIBRARY_NAME = "libsherpa-onnx-jni.so", MACOS_NATIVE_LIBRARY_NAME = "libsherpa-onnx-jni.dylib") are copied directly from the compiled binary filenames and should not be changed to match other libraries' naming conventions.
Applied to files:
sherpa-onnx/csrc/CMakeLists.txt
🧬 Code graph analysis (22)
sherpa-onnx/csrc/vocos-vocoder.cc (2)
nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_en.js (1)
config(8-22)nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_zh.js (1)
config(8-24)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend-test.cc (2)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (4)
OfflineTtsZipvoiceFrontend(199-217)OfflineTtsZipvoiceFrontend(220-241)OfflineTtsZipvoiceFrontend(367-370)OfflineTtsZipvoiceFrontend(375-378)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
OfflineTtsZipvoiceFrontend(18-57)
python-api-examples/offline-zeroshot-tts.py (2)
python-api-examples/offline-tts.py (2)
get_args(322-401)main(404-477)sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (3)
main(19-147)main(19-19)tts(100-100)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (6)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)
sherpa_onnx(26-90)sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (1)
sherpa_onnx(11-28)sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.h (1)
sherpa_onnx(10-14)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
sherpa_onnx(16-59)sherpa-onnx/csrc/offline-tts-model-config.h (1)
sherpa_onnx(17-54)sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (2)
tokens(58-180)tokens(58-59)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.h (2)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1)
sherpa_onnx(13-60)sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (2)
PybindOfflineTtsZipvoiceModelConfig(13-41)PybindOfflineTtsZipvoiceModelConfig(13-13)
sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (3)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)
sherpa_onnx(26-90)sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)
sherpa_onnx(15-39)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
sherpa_onnx(16-59)
sherpa-onnx/csrc/vocoder.cc (2)
nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_en.js (1)
config(8-22)nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_zh.js (1)
config(8-24)
sherpa-onnx/csrc/offline-tts.cc (1)
sherpa-onnx/csrc/text-utils.cc (6)
IsGB2312(635-655)IsGB2312(635-635)Gb2312ToUtf8(658-686)Gb2312ToUtf8(658-658)IsUtf8(570-633)IsUtf8(570-570)
sherpa-onnx/csrc/offline-tts.h (2)
sherpa-onnx/c-api/cxx-api.cc (2)
Generate(458-476)Generate(458-461)sherpa-onnx/csrc/offline-tts.cc (4)
Generate(173-197)Generate(173-175)Generate(199-237)Generate(199-203)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (2)
sherpa-onnx/csrc/offline-tts-model-config.cc (6)
Register(11-26)Register(11-11)Validate(28-57)Validate(28-28)ToString(59-73)ToString(59-59)sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (2)
tokens(58-180)tokens(58-59)
sherpa-onnx/csrc/text-utils.h (1)
sherpa-onnx/csrc/text-utils.cc (16)
Utf8ToUtf32(727-730)Utf8ToUtf32(727-727)Utf32ToUtf8(732-735)Utf32ToUtf8(732-732)ToUpperAscii(739-744)ToUpperAscii(739-739)ToLowerAscii(748-753)ToLowerAscii(748-748)IsCJK(756-761)IsCJK(756-756)ContainsCJK(763-766)ContainsCJK(763-763)ContainsCJK(768-775)ContainsCJK(768-768)StringToBool(777-790)StringToBool(777-777)
sherpa-onnx/csrc/offline-tts-zipvoice-model.h (5)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)
sherpa_onnx(26-90)sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1)
sherpa_onnx(13-60)sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (1)
sherpa_onnx(11-28)sherpa-onnx/csrc/offline-tts-model-config.h (1)
sherpa_onnx(17-54)sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (14)
OfflineTtsZipvoiceModel(287-289)OfflineTtsZipvoiceModel(292-294)OfflineTtsZipvoiceModel(296-296)OfflineTtsZipvoiceModel(313-314)OfflineTtsZipvoiceModel(318-319)Run(303-310)Run(303-307)tokens(58-180)tokens(58-59)GetMetaData(298-301)GetMetaData(298-299)Impl(33-41)Impl(33-33)Impl(44-52)
sherpa-onnx/csrc/offline-tts-impl.h (3)
harmony-os/SherpaOnnxHar/sherpa_onnx/src/main/cpp/non-streaming-tts.cc (1)
callback(469-496)scripts/go/sherpa_onnx.go (1)
GeneratedAudio(963-968)sherpa-onnx/csrc/offline-tts.cc (4)
SampleRate(239-239)SampleRate(239-239)NumSpeakers(241-241)NumSpeakers(241-241)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (6)
sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)
OfflineTtsZipvoiceModel(17-37)sherpa-onnx/csrc/offline-tts-impl.h (1)
OfflineTtsImpl(17-55)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
OfflineTtsZipvoiceFrontend(18-57)sherpa-onnx/csrc/vocoder.cc (6)
Create(76-103)Create(76-76)Create(106-133)Create(106-107)Create(136-137)Create(141-142)sherpa-onnx/csrc/vocoder.h (1)
Vocoder(17-31)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (4)
OfflineTtsZipvoiceFrontend(199-217)OfflineTtsZipvoiceFrontend(220-241)OfflineTtsZipvoiceFrontend(367-370)OfflineTtsZipvoiceFrontend(375-378)
sherpa-onnx/csrc/text-utils-test.cc (1)
sherpa-onnx/csrc/text-utils.cc (13)
RemoveInvalidUtf8Sequences(486-568)RemoveInvalidUtf8Sequences(486-487)SplitUtf8(357-393)SplitUtf8(357-357)ToWideString(689-694)ToWideString(689-689)i(113-113)ToString(696-701)ToString(696-696)ContainsCJK(763-766)ContainsCJK(763-763)ContainsCJK(768-775)ContainsCJK(768-768)
sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (3)
sherpa-onnx/csrc/ten-vad-model.cc (1)
meta_data(257-314)sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)
OfflineTtsZipvoiceModel(17-37)sherpa-onnx/csrc/vocos-vocoder.cc (2)
Run(196-198)Run(196-196)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (4)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (4)
ToString(139-156)ToString(139-139)Validate(42-137)Validate(42-42)sherpa-onnx/python/csrc/offline-tts-kokoro-model-config.cc (1)
PybindOfflineTtsKokoroModelConfig(11-38)sherpa-onnx/python/csrc/offline-tts-vits-model-config.cc (1)
PybindOfflineTtsVitsModelConfig(11-37)sherpa-onnx/python/csrc/offline-tts-matcha-model-config.cc (1)
PybindOfflineTtsMatchaModelConfig(11-37)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (3)
sherpa-onnx/csrc/text-utils.cc (16)
is(785-785)ToWideString(689-694)ToWideString(689-689)ToString(696-701)ToString(696-696)Utf32ToUtf8(732-735)Utf32ToUtf8(732-732)ToUpperAscii(739-744)ToUpperAscii(739-739)ToLowerAscii(748-753)ToLowerAscii(748-748)i(113-113)ContainsCJK(763-766)ContainsCJK(763-763)ContainsCJK(768-775)ContainsCJK(768-768)sherpa-onnx/csrc/piper-phonemize-lexicon.cc (12)
is(312-312)is(341-341)is(353-353)is(365-365)ToString(36-59)ToString(36-36)CallPhonemizeEspeak(61-70)CallPhonemizeEspeak(61-63)InitEspeak(277-305)InitEspeak(277-277)ConvertTextToTokenIds(423-436)ConvertTextToTokenIds(423-424)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (1)
OfflineTtsZipvoiceFrontend(18-57)
sherpa-onnx/python/csrc/offline-tts-model-config.cc (1)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (2)
PybindOfflineTtsZipvoiceModelConfig(13-41)PybindOfflineTtsZipvoiceModelConfig(13-13)
sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (6)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)
sherpa_onnx(26-90)sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h (1)
sherpa_onnx(13-60)sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (1)
sherpa_onnx(11-28)sherpa-onnx/csrc/text-utils.h (1)
sherpa_onnx(25-175)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc (6)
OfflineTtsZipvoiceFrontend(199-217)OfflineTtsZipvoiceFrontend(220-241)OfflineTtsZipvoiceFrontend(367-370)OfflineTtsZipvoiceFrontend(375-378)ConvertTextToTokenIds(243-364)ConvertTextToTokenIds(243-244)sherpa-onnx/csrc/offline-tts-frontend.h (1)
OfflineTtsFrontend(38-56)
sherpa-onnx/csrc/text-utils.cc (1)
sherpa-onnx/csrc/kokoro-multi-lang-lexicon.cc (14)
text(171-180)text(171-171)text(204-259)text(204-205)text(261-273)text(261-262)text(275-403)text(275-276)is(406-406)is(418-435)is(418-418)is(445-445)is(466-506)is(466-466)
sherpa-onnx/python/csrc/offline-tts.cc (4)
harmony-os/SherpaOnnxHar/sherpa_onnx/src/main/cpp/non-streaming-tts.cc (1)
callback(469-496)android/SherpaOnnxTts/app/src/main/java/com/k2fsa/sherpa/onnx/MainActivity.kt (1)
callback(98-106)kotlin-api-examples/test_tts.kt (1)
callback(157-163)android/SherpaOnnxTtsEngine/app/src/main/java/com/k2fsa/sherpa/onnx/tts/engine/MainActivity.kt (1)
callback(311-323)
🪛 Ruff (0.12.2)
sherpa-onnx/python/sherpa_onnx/__init__.py
52-52: _sherpa_onnx.OfflineTtsZipvoiceModelConfig imported but unused
(F401)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: Release shared-OFF tts-OFF
- GitHub Check: Release shared-ON tts-OFF
- GitHub Check: Debug shared-ON tts-OFF
- GitHub Check: ubuntu-24.04 3.13
- GitHub Check: ubuntu-24.04 3.9
- GitHub Check: ubuntu-24.04 3.11
- GitHub Check: ubuntu-24.04 3.8
- GitHub Check: ubuntu-24.04 3.12
- GitHub Check: ubuntu-24.04 3.10
- GitHub Check: ubuntu-latest Debug static tts-OFF
- GitHub Check: ubuntu-latest Debug shared tts-ON
- GitHub Check: ubuntu-latest Debug static tts-ON
- GitHub Check: ubuntu-latest Debug shared tts-OFF
- GitHub Check: ubuntu-latest Release static tts-ON
- GitHub Check: ubuntu-latest Release static tts-OFF
- GitHub Check: ubuntu-latest Release shared tts-OFF
- GitHub Check: ubuntu-latest Release shared tts-ON
- GitHub Check: rknn shared ON
- GitHub Check: rknn shared OFF
- GitHub Check: swift (macos-13)
🔇 Additional comments (23)
python-api-examples/offline-tts.py (2)
3-3: Copyright year range update — LGTMDocs-only change; no functional impact.
475-477: RTF print reflow — LGTMPure formatting; same behavior.
sherpa-onnx/csrc/offline-tts-matcha-impl.h (1)
143-145: Whitespace-only tidy — LGTMNo behavioral change; thanks for keeping the file clean.
sherpa-onnx/python/csrc/CMakeLists.txt (1)
75-75: Offline TTS ZipVoice model‐config binding verified
- Found
offline-tts-zipvoice-model-config.ccand.hundersherpa-onnx/python/csrc.PybindOfflineTtsZipvoiceModelConfigis defined and invoked in the binding sources.OfflineTtsZipvoiceModelConfigis correctly re-exported insherpa_onnx/__init__.py.LGTM!
sherpa-onnx/csrc/offline-tts-model-config.cc (2)
15-16: LGTM: ZipVoice options registered.Registration ordering alongside vits/matcha/kokoro/kitten looks consistent.
66-66: LGTM: ToString includes ZipVoice section.Helpful for debugging full model configuration dumps.
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.h (1)
1-16: LGTM: Clean binding declaration and guard.Header matches existing binding patterns; include path and guard naming are consistent.
sherpa-onnx/csrc/vocoder.cc (1)
77-88: Outdated file reference: novocoder.ccin the codebaseI couldn’t find any file at
sherpa-onnx/csrc/vocoder.cc, nor any occurrences of
config.matcha.vocoderorconfig.zipvoice.vocoderin the repository.
The only “vocoder”‐related implementation is inhifigan-vocoder.cc, which doesn’t contain that snippet.Please update or remove this review comment, since it refers to a file and code that no longer exist.
Likely an incorrect or invalid review comment.
sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (1)
102-111: No additional sample‐rate guard neededThe ZipVoice backend already handles prompt audio sample-rate mismatches by resampling internally in
ComputeMelSpectrogram(usingLinearResamplewhensample_rate != meta.sample_rateinoffline-tts-zipvoice-impl.haround lines 123–138). Consequently, the wrapper insherpa-onnx-offline-zeroshot-tts.ccdoes not require an explicit check or manual resampling.sherpa-onnx/csrc/CMakeLists.txt (4)
204-206: ZipVoice sources wired into TTS build path — looks goodNew ZipVoice sources are properly gated under SHERPA_ONNX_ENABLE_TTS and appended to sherpa-onnx-core sources. No issues spotted here.
355-356: New binary sherpa-onnx-offline-zeroshot-tts integrated correctlyExecutable is added under the TTS gate and appended to main_exes for uniform linking, rpath setup, and installation. This matches existing patterns for other binaries.
Also applies to: 377-379
652-656: Good addition of ZipVoice frontend test to the test suiteThe new test is correctly appended under SHERPA_ONNX_ENABLE_TTS and will be compiled and registered via sherpa_onnx_add_test. Looks consistent with existing tests.
270-275: Unconditional linking ofcppinyin_coreis safe; no guard neededThe root CMakeLists.txt always includes the
cppinyinmodule (around line 369) regardless of theSHERPA_ONNX_ENABLE_TTSsetting. That module’sdownload_cppinyin()function unconditionally defines and populates thecppinyin_coretarget, so it’s always available for linking—TTS being disabled does not remove or skip its inclusion. You can safely leave the existingtarget_link_libraries(sherpa-onnx-core cppinyin_core …)as is.Likely an incorrect or invalid review comment.
sherpa-onnx/csrc/offline-tts-model-config.h (1)
14-14: ZipVoice added to OfflineTtsModelConfig — ensure Register/Validate/ToString are updatedStruct wiring looks consistent (member, include, ctor signature, and init). Please ensure:
- Register(ParseOptions*): flags for all ZipVoice fields (tokens, text_model, flow_matching_model, vocoder, data_dir, pinyin_dict, num_steps, feat_scale, t_shift, target_rms, guidance_scale).
- Validate(): enforces required fields for ZipVoice and provides actionable error messages. Also consider guarding multiple-backend selection (e.g., return error if more than one of vits/matcha/kokoro/zipvoice/kitten is configured).
- ToString(): includes the ZipVoice fields and redacts sensitive paths if that’s the convention.
Optionally, add explicit mutual-exclusion validation to prevent ambiguous backend selection. Example (illustrative, adjust to your style):
bool OfflineTtsModelConfig::Validate() const { int active = 0; active += !vits.model.empty(); active += !matcha.acoustic_model.empty(); active += !kokoro.model.empty(); active += !zipvoice.text_model.empty() && !zipvoice.flow_matching_model.empty(); active += !kitten.model.empty(); if (active != 1) { SHERPA_ONNX_LOGE("Exactly one TTS backend must be configured; got %d.", active); return false; } // ZipVoice required fields if (!zipvoice.text_model.empty() || !zipvoice.flow_matching_model.empty()) { if (zipvoice.tokens.empty()) { SHERPA_ONNX_LOGE("ZipVoice requires --zipvoice-tokens."); return false; } } // ... other backend-specific checks ... return true; }Also applies to: 23-25, 35-36, 42-43
sherpa-onnx/csrc/offline-tts-impl.cc (1)
23-23: Include for ZipVoice impl — OKIncluding offline-tts-zipvoice-impl.h is correct and localizes the new backend wiring here.
sherpa-onnx/csrc/offline-tts-zipvoice-model.h (1)
23-25: I’ve added searches to locate and inspect the.ccfile for any out-of-line template definitions or explicit instantiations. Once the results are in, I can confirm whether the ctor needs to be moved inline or if dedicatedtemplate <>instantiations should be added.sherpa-onnx/python/csrc/offline-tts.cc (1)
87-88: Good call: release the GIL around the long-running generate() callAdding py::call_guardpy::gil_scoped_release() prevents blocking the interpreter during synthesis. Looks correct.
sherpa-onnx/python/csrc/offline-tts-model-config.cc (2)
14-23: Registration order is fine; ensure ZipVoice is initialized before useYou correctly register PybindOfflineTtsZipvoiceModelConfig(m) before exposing it on OfflineTtsModelConfig. No issues spotted.
27-41: No positionalOfflineTtsModelConfigcalls detectedI ran the suggested grep over all Python and C/C++ files for calls to
OfflineTtsModelConfig(...)without keyword arguments. The output only included the class’s own definition and binding code—no client code passes positional arguments. It’s safe to keep the newzipvoiceparameter ordering without breaking existing calls.sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (1)
16-41: Bindings look consistent with C++ config; minor polish onlyConstructors, defaults, and attributes match the C++ type and Validate/ToString methods. Good to go.
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (1)
139-156: LGTM! Clean string representation implementation.The
ToString()method properly formats all configuration fields with appropriate quoting for string fields.sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (2)
52-53: Fix inconsistent parameter name between declaration and usage.The parameter is named
num_stepsin the declaration but used asnum_stepsin the implementation, which is correct. However, ensure consistency throughout.
297-302: Incorrect conditional logic for RMS scaling.The condition checks if
prompt_rms < target_rmsbut also checkstarget_rms > 0.0f. The second condition should checkprompt_rms > 0.0fto avoid division by zero when calculating the scale factor.- if (prompt_rms < target_rms && target_rms > 0.0f) { + if (prompt_rms < target_rms && prompt_rms > 0.0f) {Likely an incorrect or invalid review comment.
| help="The scale classifier-free guidance during inference for for Zipvoice.", | ||
| ) |
There was a problem hiding this comment.
Fix user-facing typo in help text (“for for”)
Typo in CLI help: “for for Zipvoice.” → “for Zipvoice.” This shows up in --help.
Apply:
- help="The scale classifier-free guidance during inference for for Zipvoice.",
+ help="The scale of classifier-free guidance during inference for Zipvoice.",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| help="The scale classifier-free guidance during inference for for Zipvoice.", | |
| ) | |
| help="The scale of classifier-free guidance during inference for Zipvoice.", |
🤖 Prompt for AI Agents
In python-api-examples/offline-zeroshot-tts.py around lines 116 to 117, the CLI
help string contains a duplicated word "for for" ("The scale classifier-free
guidance during inference for for Zipvoice."); update the help text to remove
the extra "for" so it reads "The scale classifier-free guidance during inference
for Zipvoice." and ensure surrounding punctuation and spacing remain correct.
| virtual GeneratedAudio Generate( | ||
| const std::string &text, const std::string &prompt_text, | ||
| const std::vector<float> &prompt_samples, int32_t sample_rate, | ||
| float speed = 1.0, int32_t num_step = 4, | ||
| GeneratedAudioCallback callback = nullptr) const { | ||
| SHERPA_ONNX_LOGE("This interface is for Zero-shot TTS."); | ||
| exit(-1); | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Same issue for zero-shot Generate overload; also fix num_step naming
The zero-shot overload also logs and calls exit(-1). Use a pure-virtual or throw. Also s/num_step/num_steps for consistency with the rest of the codebase and Python binding.
- virtual GeneratedAudio Generate(
+ virtual GeneratedAudio Generate(
const std::string &text, const std::string &prompt_text,
const std::vector<float> &prompt_samples, int32_t sample_rate,
- float speed = 1.0, int32_t num_step = 4,
- GeneratedAudioCallback callback = nullptr) const {
- SHERPA_ONNX_LOGE("This interface is for Zero-shot TTS.");
- exit(-1);
- }
+ float speed = 1.0, int32_t num_steps = 4,
+ GeneratedAudioCallback callback = nullptr) const = 0;Or, if keeping a body:
- SHERPA_ONNX_LOGE("This interface is for Zero-shot TTS.");
- exit(-1);
+ throw std::runtime_error("OfflineTtsImpl backend does not support zero-shot Generate()");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| virtual GeneratedAudio Generate( | |
| const std::string &text, const std::string &prompt_text, | |
| const std::vector<float> &prompt_samples, int32_t sample_rate, | |
| float speed = 1.0, int32_t num_step = 4, | |
| GeneratedAudioCallback callback = nullptr) const { | |
| SHERPA_ONNX_LOGE("This interface is for Zero-shot TTS."); | |
| exit(-1); | |
| } | |
| virtual GeneratedAudio Generate( | |
| const std::string &text, | |
| const std::string &prompt_text, | |
| const std::vector<float> &prompt_samples, | |
| int32_t sample_rate, | |
| float speed = 1.0, int32_t num_steps = 4, | |
| GeneratedAudioCallback callback = nullptr) const = 0; |
| // @param text The string to be generated. | ||
| // @param prompt_text The transcribe of prompt_sampes. | ||
| // @param prompt_samples The audio samples of the prompt, single channel pcm. | ||
| // @param sample_rate The sample rate of the prompt audio. | ||
| // @param speed The speed for the generated speech. E.g., 2 means 2x faster. | ||
| // @param num_steps The number of flow steps to generate the audio. | ||
| // @param callback If not NULL, it is called whenever config.max_num_sentences | ||
| // sentences have been processed. Note that the passed | ||
| // pointer `samples` for the callback might be invalidated | ||
| // after the callback is returned, so the caller should not | ||
| // keep a reference to it. The caller can copy the data if | ||
| // he/she wants to access the samples after the callback | ||
| // returns. The callback is called in the current thread. | ||
| GeneratedAudio Generate(const std::string &text, | ||
| const std::string &prompt_text, | ||
| const std::vector<float> &prompt_samples, | ||
| int32_t sample_rate, float speed = 1.0, | ||
| int32_t num_steps = 4, | ||
| GeneratedAudioCallback callback = nullptr) const; |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Fix typos and improve parameter docs for the new Generate overload.
- Typo: “transcribe of prompt_sampes” → “transcript of prompt_samples”.
- Consider clarifying that prompt_samples are mono PCM floats in [-1, 1] and sample_rate must be positive.
- If num_steps here is intended to override config.model.zipvoice.num_steps for this call, document precedence.
Apply this doc fix (typos + minor wording):
- // @param text The string to be generated.
- // @param prompt_text The transcribe of prompt_sampes.
- // @param prompt_samples The audio samples of the prompt, single channel pcm.
- // @param sample_rate The sample rate of the prompt audio.
+ // @param text The string to be synthesized.
+ // @param prompt_text The transcript of prompt_samples.
+ // @param prompt_samples The prompt audio samples (mono PCM float, range [-1, 1]).
+ // @param sample_rate The sample rate (Hz) of prompt_samples; must be > 0.
// @param speed The speed for the generated speech. E.g., 2 means 2x faster.
- // @param num_steps The number of flow steps to generate the audio.
+ // @param num_steps The number of flow steps used to generate the audio.
// @param callback If not NULL, it is called whenever config.max_num_sentences
// sentences have been processed. Note that the passed
// pointer `samples` for the callback might be invalidated
// after the callback is returned, so the caller should not
// keep a reference to it. The caller can copy the data if
// he/she wants to access the samples after the callback
// returns. The callback is called in the current thread.Please confirm precedence between this num_steps parameter and config.model.zipvoice.num_steps (per-call override vs. config default), and update the comment accordingly.
I’ll check how num_steps interacts with config.model.zipvoice.num_steps in the implementation.
🏁 Script executed:
#!/bin/bash
# Locate the offline-tts implementation file
fd -t f -g "offline-tts.cc"
# Inspect the Generate overload to see how num_steps is used
rg -n "Generate" -C3 --ignore-case sherpa-onnx/csrc/offline-tts.cc
# Find where config.model.zipvoice.num_steps is referenced
rg -n "zipvoice" -C3 sherpa-onnxLength of output: 43138
🏁 Script executed:
#!/bin/bash
# Search for the implementation of Generate in OfflineTtsZipvoiceImpl
rg -n "class OfflineTtsZipvoiceImpl" -C5 sherpa-onnx/csrc/offline-tts-zipvoice-impl.h
# Search for the Generate method declaration in the same file
rg -n "Generate" -C3 sherpa-onnx/csrc/offline-tts-zipvoice-impl.hLength of output: 1574
🏁 Script executed:
#!/bin/bash
# Check if the ZipVoice override parameter num_steps is used in preference to config.model.zipvoice.num_steps
rg -n "config\\.model\\.zipvoice\\.num_steps" -C3 sherpa-onnx/csrc/offline-tts-zipvoice-impl.cc
# Inspect the Generate implementation to see how the passed-in num_steps is applied
sed -n '1,200p' sherpa-onnx/csrc/offline-tts-zipvoice-impl.cc | rg -n "Generate" -C20Length of output: 353
Update docs for Generate overload in offline-tts.h
Confirmed that the num_steps parameter in this overload is always used directly (it takes precedence over config.model.zipvoice.num_steps). Please apply the following documentation fixes:
--- a/sherpa-onnx/csrc/offline-tts.h
+++ b/sherpa-onnx/csrc/offline-tts.h
@@ -98,11 +98,13 @@ class OfflineTtsImpl {
- // @param text The string to be generated.
- // @param prompt_text The transcribe of prompt_sampes.
- // @param prompt_samples The audio samples of the prompt, single channel pcm.
- // @param sample_rate The sample rate of the prompt audio.
+ // @param text The string to be synthesized.
+ // @param prompt_text The transcript of `prompt_samples`.
+ // @param prompt_samples The prompt audio samples (mono PCM floats in [-1, 1]).
+ // @param sample_rate The sample rate of `prompt_samples` in Hz; must be > 0.
// @param speed The speed for the generated speech. E.g., 2 means 2× faster.
- // @param num_steps The number of flow steps to generate the audio.
+ // @param num_steps The number of flow steps used to generate audio.
+ // This per-call value overrides `config.model.zipvoice.num_steps`.
// @param callback If not NULL, it is called whenever config.max_num_sentences
// sentences have been processed. Note that the passed
// pointer `samples` for the callback might be invalidatedsherpa-onnx/csrc/offline-tts.h:98–104
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // @param text The string to be generated. | |
| // @param prompt_text The transcribe of prompt_sampes. | |
| // @param prompt_samples The audio samples of the prompt, single channel pcm. | |
| // @param sample_rate The sample rate of the prompt audio. | |
| // @param speed The speed for the generated speech. E.g., 2 means 2x faster. | |
| // @param num_steps The number of flow steps to generate the audio. | |
| // @param callback If not NULL, it is called whenever config.max_num_sentences | |
| // sentences have been processed. Note that the passed | |
| // pointer `samples` for the callback might be invalidated | |
| // after the callback is returned, so the caller should not | |
| // keep a reference to it. The caller can copy the data if | |
| // he/she wants to access the samples after the callback | |
| // returns. The callback is called in the current thread. | |
| GeneratedAudio Generate(const std::string &text, | |
| const std::string &prompt_text, | |
| const std::vector<float> &prompt_samples, | |
| int32_t sample_rate, float speed = 1.0, | |
| int32_t num_steps = 4, | |
| GeneratedAudioCallback callback = nullptr) const; | |
| // @param text The string to be synthesized. | |
| // @param prompt_text The transcript of `prompt_samples`. | |
| // @param prompt_samples The prompt audio samples (mono PCM floats in [-1, 1]). | |
| // @param sample_rate The sample rate of `prompt_samples` in Hz; must be > 0. | |
| // @param speed The speed for the generated speech. E.g., 2× faster. | |
| // @param num_steps The number of flow steps used to generate audio. | |
| // This per-call value overrides `config.model.zipvoice.num_steps`. | |
| // @param callback If not NULL, it is called whenever config.max_num_sentences | |
| // sentences have been processed. Note that the passed | |
| // pointer `samples` for the callback might be invalidated | |
| // after the callback is returned, so the caller should not | |
| // keep a reference to it. The caller can copy the data if | |
| // he/she wants to access the samples after the callback | |
| // returns. The callback is called in the current thread. | |
| GeneratedAudio Generate(const std::string &text, | |
| const std::string &prompt_text, | |
| const std::vector<float> &prompt_samples, | |
| int32_t sample_rate, float speed = 1.0, | |
| int32_t num_steps = 4, | |
| GeneratedAudioCallback callback = nullptr) const; |
🤖 Prompt for AI Agents
In sherpa-onnx/csrc/offline-tts.h around lines 98 to 116, the Generate overload
docs are ambiguous about num_steps; update the comment for the num_steps
parameter to state explicitly that this argument is always used directly (it
takes precedence over config.model.zipvoice.num_steps) and that it controls the
number of flow steps for audio generation; adjust any wording that suggests the
config can override this parameter and keep the rest of the parameter
descriptions unchanged.
| if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) { | ||
| // if the previous part is "other", we start a new sentence | ||
| oss << text_parts[i]; | ||
| t_lang = types[i]; | ||
| } else { | ||
| if ((t_lang == types[i] || types[i] == "other") && t_lang != "pinyin" && | ||
| t_lang != "tag") { | ||
| // same language, continue |
There was a problem hiding this comment.
Fix always-true condition: use && instead of ||
(types[i] != "tag" || types[i] != "pinyin") is always true, so the branch executes for every types[i]. The intended logic is clearly “not tag and not pinyin”.
Apply this diff:
- if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) {
+ if (t_lang == "other" && (types[i] != "tag" && types[i] != "pinyin")) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) { | |
| // if the previous part is "other", we start a new sentence | |
| oss << text_parts[i]; | |
| t_lang = types[i]; | |
| } else { | |
| if ((t_lang == types[i] || types[i] == "other") && t_lang != "pinyin" && | |
| t_lang != "tag") { | |
| // same language, continue | |
| if (t_lang == "other" && (types[i] != "tag" && types[i] != "pinyin")) { | |
| // if the previous part is "other", we start a new sentence | |
| oss << text_parts[i]; | |
| t_lang = types[i]; | |
| } else { | |
| if ((t_lang == types[i] || types[i] == "other") && t_lang != "pinyin" && | |
| t_lang != "tag") { | |
| // same language, continue |
🤖 Prompt for AI Agents
In sherpa-onnx/csrc/text-utils-test.cc around lines 194 to 201, the condition if
(t_lang == "other" && (types[i] != "tag" || types[i] != "pinyin")) is always
true; change the inner boolean to use && so it reads (types[i] != "tag" &&
types[i] != "pinyin") to correctly express "not tag AND not pinyin", keeping
existing parentheses and spacing.
| std::vector<char> buffer; | ||
| if (!config.matcha.vocoder.empty()) { | ||
| SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); | ||
| buffer = ReadFile(mgr, config.matcha.vocoder); | ||
| } else if (!config.zipvoice.vocoder.empty()) { | ||
| SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s", | ||
| config.zipvoice.vocoder.c_str()); | ||
| buffer = ReadFile(mgr, config.zipvoice.vocoder); | ||
| } else { | ||
| SHERPA_ONNX_LOGE("No vocoder model provided in the config!"); | ||
| exit(-1); | ||
| } | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Same fix for the resource-manager overload
Mirror the selected-path handling and avoid exit(-1).
- std::vector<char> buffer;
+ std::vector<char> buffer;
+ std::string vocoder_path;
if (!config.matcha.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str());
- buffer = ReadFile(mgr, config.matcha.vocoder);
+ vocoder_path = config.matcha.vocoder;
+ buffer = ReadFile(mgr, vocoder_path);
} else if (!config.zipvoice.vocoder.empty()) {
SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s",
config.zipvoice.vocoder.c_str());
- buffer = ReadFile(mgr, config.zipvoice.vocoder);
+ vocoder_path = config.zipvoice.vocoder;
+ buffer = ReadFile(mgr, vocoder_path);
} else {
SHERPA_ONNX_LOGE("No vocoder model provided in the config!");
- exit(-1);
+ return nullptr;
}Follow-up required outside this hunk:
- In the subsequent
kHifiganbranch, passvocoder_pathtoHifiganVocoder(mgr, ...).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::vector<char> buffer; | |
| if (!config.matcha.vocoder.empty()) { | |
| SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); | |
| buffer = ReadFile(mgr, config.matcha.vocoder); | |
| } else if (!config.zipvoice.vocoder.empty()) { | |
| SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s", | |
| config.zipvoice.vocoder.c_str()); | |
| buffer = ReadFile(mgr, config.zipvoice.vocoder); | |
| } else { | |
| SHERPA_ONNX_LOGE("No vocoder model provided in the config!"); | |
| exit(-1); | |
| } | |
| std::vector<char> buffer; | |
| std::string vocoder_path; | |
| if (!config.matcha.vocoder.empty()) { | |
| SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); | |
| vocoder_path = config.matcha.vocoder; | |
| buffer = ReadFile(mgr, vocoder_path); | |
| } else if (!config.zipvoice.vocoder.empty()) { | |
| SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s", | |
| config.zipvoice.vocoder.c_str()); | |
| vocoder_path = config.zipvoice.vocoder; | |
| buffer = ReadFile(mgr, vocoder_path); | |
| } else { | |
| SHERPA_ONNX_LOGE("No vocoder model provided in the config!"); | |
| return nullptr; | |
| } |
🤖 Prompt for AI Agents
In sherpa-onnx/csrc/vocoder.cc around lines 108 to 120, the current logic
directly reads into a buffer and calls exit(-1) on missing vocoder; instead,
create a std::string vocoder_path and set it from config.matcha.vocoder or
config.zipvoice.vocoder (mirroring the selected-path handling), log which path
is chosen, use ReadFile(mgr, vocoder_path) to populate the buffer, and remove
the exit(-1) path by returning an error or throwing as appropriate;
additionally, in the later kHifigan branch, pass this vocoder_path into the
HifiganVocoder(mgr, ...) constructor call.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
sherpa-onnx/csrc/vocoder.cc (2)
93-96: Pass the actually selected vocoder_path to HifiganVocoderUsing config.matcha.vocoder here breaks when the selected path is ZipVoice’s.
- return std::make_unique<HifiganVocoder>( - config.num_threads, config.provider, config.matcha.vocoder); + return std::make_unique<HifiganVocoder>( + config.num_threads, config.provider, vocoder_path);
125-127: Pass vocoder_path in the Manager overload as well*- return std::make_unique<HifiganVocoder>( - mgr, config.num_threads, config.provider, config.matcha.vocoder); + return std::make_unique<HifiganVocoder>( + mgr, config.num_threads, config.provider, vocoder_path);
♻️ Duplicate comments (7)
sherpa-onnx/csrc/vocoder.cc (2)
77-88: Use info-level logs, avoid exit(), and capture the selected vocoder pathSwitch to info logs for selection, record the chosen path, and return nullptr instead of terminating the process. Also read via the selected path.
- std::vector<char> buffer; + std::vector<char> buffer; + std::string vocoder_path; if (!config.matcha.vocoder.empty()) { - SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); - buffer = ReadFile(config.matcha.vocoder); + SHERPA_ONNX_LOGI("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); + vocoder_path = config.matcha.vocoder; + buffer = ReadFile(vocoder_path); } else if (!config.zipvoice.vocoder.empty()) { - SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s", - config.zipvoice.vocoder.c_str()); - buffer = ReadFile(config.zipvoice.vocoder); + SHERPA_ONNX_LOGI("Using zipvoice vocoder: %s", + config.zipvoice.vocoder.c_str()); + vocoder_path = config.zipvoice.vocoder; + buffer = ReadFile(vocoder_path); } else { SHERPA_ONNX_LOGE("No vocoder model provided in the config!"); - exit(-1); + return nullptr; }
108-119: Mirror the same fixes for the Manager overload (log level, selected path, no exit)*Use info logs, track vocoder_path, and avoid exit.
- std::vector<char> buffer; + std::vector<char> buffer; + std::string vocoder_path; if (!config.matcha.vocoder.empty()) { - SHERPA_ONNX_LOGE("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); - buffer = ReadFile(mgr, config.matcha.vocoder); + SHERPA_ONNX_LOGI("Using matcha vocoder: %s", config.matcha.vocoder.c_str()); + vocoder_path = config.matcha.vocoder; + buffer = ReadFile(mgr, vocoder_path); } else if (!config.zipvoice.vocoder.empty()) { - SHERPA_ONNX_LOGE("Using zipvoice vocoder: %s", - config.zipvoice.vocoder.c_str()); - buffer = ReadFile(mgr, config.zipvoice.vocoder); + SHERPA_ONNX_LOGI("Using zipvoice vocoder: %s", + config.zipvoice.vocoder.c_str()); + vocoder_path = config.zipvoice.vocoder; + buffer = ReadFile(mgr, vocoder_path); } else { SHERPA_ONNX_LOGE("No vocoder model provided in the config!"); return nullptr; }sherpa-onnx/csrc/offline-tts-impl.h (3)
28-34: Prefer pure-virtual over throwing in the base Generate()Restores compile-time guarantees and avoids surprising runtime throws for non-zero-shot backends.
- virtual GeneratedAudio Generate( - const std::string &text, int64_t sid = 0, float speed = 1.0, - GeneratedAudioCallback callback = nullptr) const { - throw std::runtime_error( - "OfflineTtsImpl backend does not support non zero-shot Generate()"); - } + virtual GeneratedAudio Generate( + const std::string &text, int64_t sid = 0, float speed = 1.0, + GeneratedAudioCallback callback = nullptr) const = 0;
35-43: Same for zero-shot overload; also align param name with public API (num_steps)Use pure-virtual and rename num_step to num_steps for consistency.
- virtual GeneratedAudio Generate( - const std::string &text, const std::string &prompt_text, - const std::vector<float> &prompt_samples, int32_t sample_rate, - float speed = 1.0, int32_t num_step = 4, - GeneratedAudioCallback callback = nullptr) const { - throw std::runtime_error( - "OfflineTtsImpl backend does not support zero-shot Generate()"); - } + virtual GeneratedAudio Generate( + const std::string &text, const std::string &prompt_text, + const std::vector<float> &prompt_samples, int32_t sample_rate, + float speed = 1.0, int32_t num_steps = 4, + GeneratedAudioCallback callback = nullptr) const = 0;
49-52: Keep NumSpeakers() pure-virtualAvoids hard runtime errors and keeps interfaces explicit.
- virtual int32_t NumSpeakers() const { - throw std::runtime_error( - "Zero-shot OfflineTts does not support NumSpeakers()"); - } + virtual int32_t NumSpeakers() const = 0;sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (2)
10-10: Replace deprecated and add missing includeis required for std::array used below; is deprecated.
-#include <strstream> +#include <sstream> +#include <array>
235-249: Handle empty prompt_samples to avoid division by zero in RMSProtect sqrt(sum/0) and only scale when non-empty.
- prompt_rms = std::sqrt(sum_sq / prompt_samples_scaled.size()); - if (prompt_rms < target_rms && prompt_rms > 0.0f) { + if (!prompt_samples_scaled.empty()) { + prompt_rms = std::sqrt(sum_sq / prompt_samples_scaled.size()); + } else { + prompt_rms = 0.0f; + } + if (!prompt_samples_scaled.empty() && + prompt_rms < target_rms && prompt_rms > 0.0f) { float scale = target_rms / static_cast<float>(prompt_rms); for (auto &s : prompt_samples_scaled) { s *= scale; } }
🧹 Nitpick comments (13)
sherpa-onnx/csrc/text-utils-test.cc (2)
7-8: Remove unused headers; include what you use
<regex>and<sstream>aren’t used. The file usesstd::vector,uint8_t, and streams; include those explicitly to avoid relying on transitive headers.-#include <regex> -#include <sstream> +#include <vector> +#include <cstdint> +#include <iostream>
85-89: Deduplicate identical tests for 0x20 0xC4 caseThree tests assert the same behavior (space followed by orphan
0xC4→ space). Keep one with a precise name to reduce noise.-TEST(RemoveInvalidUtf8Sequences, BreakingCase_SpaceFollowedByInvalidByte) { - std::string input = "\x20\xC4"; // Space followed by an invalid byte - std::string expected = " "; // 0xC4 removed +TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidLeadByteAfterSpace) { + std::string input = "\x20\xC4"; // space 0x20 then orphan lead byte 0xC4 + std::string expected = " "; // 0xC4 removed; space remains EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected); } -TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidByte) { - std::string input = "\x20\xC4"; // Space (0x20) followed by invalid (0xC4) - std::string expected = " "; // Space remains, 0xC4 is removed - EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected); -} - -TEST(RemoveInvalidUtf8Sequences, SpaceFollowedByInvalidByte_Breaking) { - std::string input = "\x20\xc4"; // Space followed by invalid `0xc4` - std::string expected = " "; // `0xc4` should be removed, space remains - EXPECT_EQ(RemoveInvalidUtf8Sequences(input), expected); -}Also applies to: 102-106, 114-118
sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (8)
5-11: Add missing header for printf/fprintf; drop unused include.Include explicitly and remove unused to avoid transitive-dependency build fragility and warnings.
-#include <chrono> // NOLINT -#include <fstream> +#include <chrono> // NOLINT +#include <cstdio>
13-17: Throttle progress logs; current per-callback printf will spam and slow down TTS.Print only on noticeable progress deltas and reuse the same line.
-static int32_t AudioCallback(const float * /*samples*/, int32_t n, - float progress) { - printf("sample=%d, progress=%f\n", n, progress); - return 1; -} +static int32_t AudioCallback(const float * /*samples*/, int32_t /*n*/, + float progress) { + static float last = -1.0f; + if (progress >= 1.0f || progress - last >= 0.01f) { + printf("\rprogress=%.1f%%", progress * 100.0f); + fflush(stdout); + last = progress; + } + return 1; +}
69-71: Validate num-steps and speed after parsing.Prevent invalid inputs early; avoids cryptic downstream failures.
config.Register(&po); po.Read(argc, argv); + if (num_steps <= 0) { + fprintf(stderr, "Error: --num-steps must be >= 1\n"); + exit(EXIT_FAILURE); + } + if (speed <= 0) { + fprintf(stderr, "Error: --speed must be > 0\n"); + exit(EXIT_FAILURE); + }
62-66: Fix help text wording and clarify expected audio format.Minor copyedits; clarify it’s a mono PCM WAV file.
- po.Register("prompt-text", &prompt_text, "The transcribe of prompt_samples."); + po.Register("prompt-text", &prompt_text, + "Transcript of the prompt audio (must match --prompt-audio)."); po.Register("prompt-audio", &prompt_audio, - "The prompt audio file, single channel pcm. "); + "Path to the prompt audio file (single-channel PCM WAV).");
78-83: Make the positional-arg guidance OS-agnostic.Users on Windows often need double quotes. Use “quotes” generically.
- fprintf(stderr, - "Error: Accept only one positional argument. Please use single " - "quotes to wrap your text\n"); + fprintf(stderr, + "Error: Accept only one positional argument. Please use quotes to " + "wrap your text if it contains spaces.\n");
107-110: Unify failure exit path.Elsewhere you use exit(EXIT_FAILURE); returning -1 here is inconsistent.
- fprintf(stderr, "Failed to read '%s'\n", prompt_audio.c_str()); - return -1; + fprintf(stderr, "Failed to read '%s'\n", prompt_audio.c_str()); + exit(EXIT_FAILURE);
124-127: Simplify elapsed time calculation.Cleaner and avoids integer truncation concerns.
- float elapsed_seconds = - std::chrono::duration_cast<std::chrono::milliseconds>(end - begin) - .count() / - 1000.; + float elapsed_seconds = + std::chrono::duration<float>(end - begin).count();
143-145: Send success messages to stdout, keep diagnostics on stderr.Improves CLI UX and stream separation.
- fprintf(stderr, "The text is: %s.\n", po.GetArg(1).c_str()); - fprintf(stderr, "Saved to %s successfully!\n", output_filename.c_str()); + fprintf(stdout, "The text is: %s.\n", po.GetArg(1).c_str()); + fprintf(stdout, "Saved to %s successfully!\n", output_filename.c_str());sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (1)
16-17: Optional: add a brief class docstring.Improves discoverability in help() and IDEs without impacting ABI.
- py::class_<PyClass>(*m, "OfflineTtsZipvoiceModelConfig") + py::class_<PyClass>(*m, "OfflineTtsZipvoiceModelConfig", + "Configuration for ZipVoice offline TTS.\n" + "Fields: tokens, text_model, flow_matching_model, vocoder, " + "data_dir, pinyin_dict, feat_scale, t_shift, target_rms, guidance_scale.")sherpa-onnx/csrc/vocoder.cc (1)
26-30: Fix typo: kVocoos → kVocosenum class ModelType : std::uint8_t { kHifigan, - kVocoos, + kVocos, kUnknown, }; - } else if (model_type == "vocos") { - return ModelType::kVocoos; + } else if (model_type == "vocos") { + return ModelType::kVocos; - case ModelType::kVocoos: + case ModelType::kVocos: return std::make_unique<VocosVocoder>(config);Also applies to: 68-69, 95-96
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (1)
124-131: Use debug/info-level logging for resampler creationDon’t log at error level for expected paths; gate on debug.
- SHERPA_ONNX_LOGE( - "Creating a resampler:\n" - " in_sample_rate: %d\n" - " output_sample_rate: %d\n", - sample_rate, static_cast<int32_t>(meta.sample_rate)); + if (config_.model.debug) { + SHERPA_ONNX_LOGI( + "Creating a resampler:\n" + " in_sample_rate: %d\n" + " output_sample_rate: %d", + sample_rate, static_cast<int32_t>(meta.sample_rate)); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (15)
python-api-examples/offline-zeroshot-tts.py(1 hunks)sherpa-onnx/csrc/offline-tts-impl.h(2 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-impl.h(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h(1 hunks)sherpa-onnx/csrc/offline-tts-zipvoice-model.cc(1 hunks)sherpa-onnx/csrc/offline-tts.cc(1 hunks)sherpa-onnx/csrc/offline-tts.h(1 hunks)sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc(1 hunks)sherpa-onnx/csrc/text-utils-test.cc(3 hunks)sherpa-onnx/csrc/text-utils.cc(1 hunks)sherpa-onnx/csrc/vocoder.cc(2 hunks)sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc
- sherpa-onnx/csrc/offline-tts.cc
- python-api-examples/offline-zeroshot-tts.py
- sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h
- sherpa-onnx/csrc/offline-tts.h
- sherpa-onnx/csrc/text-utils.cc
- sherpa-onnx/csrc/offline-tts-zipvoice-model-config.h
- sherpa-onnx/csrc/offline-tts-zipvoice-model.cc
- sherpa-onnx/csrc/offline-tts-zipvoice-frontend.cc
🧰 Additional context used
🧬 Code graph analysis (5)
sherpa-onnx/csrc/vocoder.cc (2)
nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_en.js (1)
config(8-22)nodejs-addon-examples/test_tts_non_streaming_matcha_icefall_zh.js (1)
config(8-24)
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (5)
sherpa-onnx/csrc/offline-tts-zipvoice-model-config.cc (4)
ToString(131-147)ToString(131-131)Validate(40-129)Validate(40-40)sherpa-onnx/python/csrc/offline-tts-model-config.cc (1)
PybindOfflineTtsModelConfig(15-48)sherpa-onnx/python/csrc/offline-tts-kokoro-model-config.cc (1)
PybindOfflineTtsKokoroModelConfig(11-38)sherpa-onnx/python/csrc/offline-tts-vits-model-config.cc (1)
PybindOfflineTtsVitsModelConfig(11-37)sherpa-onnx/python/csrc/offline-tts-matcha-model-config.cc (1)
PybindOfflineTtsMatchaModelConfig(11-37)
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (8)
sherpa-onnx/csrc/offline-tts-impl.h (2)
sherpa_onnx(16-56)OfflineTtsImpl(18-52)sherpa-onnx/csrc/offline-tts-zipvoice-frontend.h (2)
sherpa_onnx(17-60)OfflineTtsZipvoiceFrontend(19-58)sherpa-onnx/csrc/offline-tts.h (1)
sherpa_onnx(16-129)sherpa-onnx/csrc/offline-tts-zipvoice-model-meta-data.h (1)
sherpa_onnx(11-28)sherpa-onnx/csrc/offline-tts-zipvoice-model.h (2)
sherpa_onnx(15-39)OfflineTtsZipvoiceModel(17-37)sherpa-onnx/csrc/offline-tts-zipvoice-model.cc (7)
OfflineTtsZipvoiceModel(288-290)OfflineTtsZipvoiceModel(293-295)OfflineTtsZipvoiceModel(297-297)OfflineTtsZipvoiceModel(314-315)OfflineTtsZipvoiceModel(319-320)tokens(58-180)tokens(58-59)sherpa-onnx/csrc/vocoder.cc (6)
Create(76-103)Create(76-76)Create(106-133)Create(106-107)Create(136-137)Create(141-142)sherpa-onnx/csrc/vocoder.h (1)
Vocoder(17-31)
sherpa-onnx/csrc/text-utils-test.cc (1)
sherpa-onnx/csrc/text-utils.cc (2)
RemoveInvalidUtf8Sequences(486-568)RemoveInvalidUtf8Sequences(486-487)
sherpa-onnx/csrc/offline-tts-impl.h (4)
harmony-os/SherpaOnnxHar/sherpa_onnx/src/main/cpp/non-streaming-tts.cc (1)
callback(469-496)sherpa-onnx/csrc/offline-recognizer-impl.h (1)
std(36-40)scripts/go/sherpa_onnx.go (1)
GeneratedAudio(963-968)sherpa-onnx/csrc/offline-tts.cc (4)
SampleRate(239-239)SampleRate(239-239)NumSpeakers(241-241)NumSpeakers(241-241)
🔇 Additional comments (9)
sherpa-onnx/csrc/text-utils-test.cc (1)
109-112: LGTM: truncated lead byte is removedExpectation aligns with RemoveInvalidUtf8Sequences semantics (drop invalid
0xC4, keep surrounding bytes).sherpa-onnx/csrc/sherpa-onnx-offline-zeroshot-tts.cc (1)
95-98: Wording now matches validation (both flags required).Good fix; aligns error message with the OR-check.
sherpa-onnx/python/csrc/offline-tts-zipvoice-model-config.cc (5)
5-10: Includes look correct and minimal.Matches the pattern used by other TTS config bindings.
27-36: Attributes exposure LGTM.Read/write fields mirror the C++ struct and match Validate/ToString usage.
37-38: str and validate bindings are correct.Consistent with other TTS configs; returns ToString() and Validate().
16-26: ✔️ C++ ctor signature and defaults are correctly bound.Confirmed that
OfflineTtsZipvoiceModelConfigdeclares a matching 10-arg constructor (lines 35–50 insherpa-onnx/csrc/offline-tts-zipvoice-model-config.h) and that all field defaults align:
data_dirandpinyin_dictdefault to empty strings (default-constructedstd::string).feat_scale = 0.1,t_shift = 0.5,target_rms = 0.1,guidance_scale = 1.0match both the C++ ctor defaults and field initializers.No changes required.
13-39: Resolved: Zipvoice binding already integratedThe
OfflineTtsZipvoiceModelConfigbinding registration and Python export are already present:• In
sherpa-onnx/python/csrc/offline-tts-model-config.cc, line 22 callsPybindOfflineTtsZipvoiceModelConfig(m);.
• Insherpa-onnx/python/sherpa_onnx/__init__.py, line 52 exportsOfflineTtsZipvoiceModelConfig.No further changes are needed.
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h (2)
258-286: Verify de-scaling mel by feat_scale is theoretically justifiedYou scale prompt_features by feat_scale before inference, then divide mel by feat_scale afterwards. Given model nonlinearity, this may not invert cleanly. Recommend ablation/bench before keeping.
168-173: Confirmwindow_typeValue with kaldi-native-fbank
I wasn’t able to locate a definitive list of acceptedwindow_typestrings forknf::FrameExtractionOptionsin the kaldi-native-fbank source. Since some implementations expect “hann” or “hamming” rather than “hanning,” please:
- Consult the kaldi-native-fbank documentation or header where
FrameExtractionOptions::window_typeis defined.- Verify that “hanning” is recognized (or adjust to “hann”/“hamming” as required).
| virtual GeneratedAudio Generate( | ||
| const std::string &text, const std::string &prompt_text, | ||
| const std::vector<float> &prompt_samples, int32_t sample_rate, | ||
| float speed = 1.0, int32_t num_step = 4, | ||
| GeneratedAudioCallback callback = nullptr) const { | ||
| throw std::runtime_error( | ||
| "OfflineTtsImpl backend does not support zero-shot Generate()"); | ||
| } | ||
|
|
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Check for mixed usage of num_step vs num_steps across the repo
Run to confirm consistent naming:
🏁 Script executed:
#!/bin/bash
rg -nP -C2 '\bnum_step(s)?\b' --type=cpp --type=cc --type=h --type=hppLength of output: 99
To ensure we catch both num_step and num_steps occurrences across all common C/C++ file extensions, let’s retry with explicit glob patterns:
🏁 Script executed:
#!/bin/bash
# Search for both 'num_step' and 'num_steps' in C/C++ sources
rg -nP -C2 '\bnum_steps?\b' \
-g '*.cpp' -g '*.cc' -g '*.cxx' \
-g '*.h' -g '*.hpp' -g '*.hh' .Length of output: 11018
Parameter Naming Consistency: Rename num_step to num_steps
There’s a single occurrence of the singular num_step in the implementation header, while the rest of the public API and all other overrides consistently use num_steps. Let’s align them.
• sherpa-onnx/csrc/offline-tts-impl.h:38
Replace
virtual GeneratedAudio Generate(
const std::string &text, const std::string &prompt_text,
const std::vector<float> &prompt_samples, int32_t sample_rate,
float speed = 1.0, int32_t num_step = 4,
GeneratedAudioCallback callback = nullptr) const {
…
}with
virtual GeneratedAudio Generate(
const std::string &text, const std::string &prompt_text,
const std::vector<float> &prompt_samples, int32_t sample_rate,
float speed = 1.0, int32_t num_steps = 4,
GeneratedAudioCallback callback = nullptr) const {After renaming, rebuild to ensure no breakages in overrides.
🤖 Prompt for AI Agents
In sherpa-onnx/csrc/offline-tts-impl.h around lines 35 to 43, the parameter name
num_step is inconsistent with the rest of the public API which uses num_steps;
rename the parameter from num_step to num_steps in the Generate declaration and
any matching default value or usages in that header so the signature matches
overrides, then rebuild and run tests to ensure no remaining mismatches.
| explicit OfflineTtsZipvoiceImpl(const OfflineTtsConfig &config) | ||
| : config_(config), | ||
| model_(std::make_unique<OfflineTtsZipvoiceModel>(config.model)), | ||
| vocoder_(Vocoder::Create(config.model)) { | ||
| InitFrontend(); | ||
| } | ||
|
|
||
| template <typename Manager> | ||
| OfflineTtsZipvoiceImpl(Manager *mgr, const OfflineTtsConfig &config) | ||
| : config_(config), | ||
| model_(std::make_unique<OfflineTtsZipvoiceModel>(mgr, config.model)), | ||
| vocoder_(Vocoder::Create(mgr, config.model)) { | ||
| InitFrontend(mgr); | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Guard against null vocoder_ from Create()
Avoid null deref in Generate/Process. Throw a recoverable error.
explicit OfflineTtsZipvoiceImpl(const OfflineTtsConfig &config)
: config_(config),
model_(std::make_unique<OfflineTtsZipvoiceModel>(config.model)),
vocoder_(Vocoder::Create(config.model)) {
- InitFrontend();
+ if (!vocoder_) {
+ throw std::runtime_error("Vocoder::Create() returned null");
+ }
+ InitFrontend();
}
@@
OfflineTtsZipvoiceImpl(Manager *mgr, const OfflineTtsConfig &config)
: config_(config),
model_(std::make_unique<OfflineTtsZipvoiceModel>(mgr, config.model)),
vocoder_(Vocoder::Create(mgr, config.model)) {
- InitFrontend(mgr);
+ if (!vocoder_) {
+ throw std::runtime_error("Vocoder::Create(mgr) returned null");
+ }
+ InitFrontend(mgr);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| explicit OfflineTtsZipvoiceImpl(const OfflineTtsConfig &config) | |
| : config_(config), | |
| model_(std::make_unique<OfflineTtsZipvoiceModel>(config.model)), | |
| vocoder_(Vocoder::Create(config.model)) { | |
| InitFrontend(); | |
| } | |
| template <typename Manager> | |
| OfflineTtsZipvoiceImpl(Manager *mgr, const OfflineTtsConfig &config) | |
| : config_(config), | |
| model_(std::make_unique<OfflineTtsZipvoiceModel>(mgr, config.model)), | |
| vocoder_(Vocoder::Create(mgr, config.model)) { | |
| InitFrontend(mgr); | |
| } | |
| explicit OfflineTtsZipvoiceImpl(const OfflineTtsConfig &config) | |
| : config_(config), | |
| model_(std::make_unique<OfflineTtsZipvoiceModel>(config.model)), | |
| vocoder_(Vocoder::Create(config.model)) { | |
| if (!vocoder_) { | |
| throw std::runtime_error("Vocoder::Create() returned null"); | |
| } | |
| InitFrontend(); | |
| } | |
| template <typename Manager> | |
| OfflineTtsZipvoiceImpl(Manager *mgr, const OfflineTtsConfig &config) | |
| : config_(config), | |
| model_(std::make_unique<OfflineTtsZipvoiceModel>(mgr, config.model)), | |
| vocoder_(Vocoder::Create(mgr, config.model)) { | |
| if (!vocoder_) { | |
| throw std::runtime_error("Vocoder::Create(mgr) returned null"); | |
| } | |
| InitFrontend(mgr); | |
| } |
🤖 Prompt for AI Agents
sherpa-onnx/csrc/offline-tts-zipvoice-impl.h around lines 30 to 43, the
constructors call Vocoder::Create(...) but don't guard against it returning
nullptr which can cause null deref later in Generate/Process; immediately after
initializing vocoder_ in both constructors check if vocoder_ is null and throw a
recoverable exception (e.g., std::runtime_error or the project's designated
recoverable error type) with a clear message including context (class name and
model identifier) so callers can handle the failure instead of crashing.
| void InitFrontend() { | ||
| const auto &meta_data = model_->GetMetaData(); | ||
|
|
||
| if (meta_data.use_pinyin && config_.model.zipvoice.pinyin_dict.empty()) { | ||
| SHERPA_ONNX_LOGE( | ||
| "Please provide --zipvoice-pinyin-dict for converting Chinese into " | ||
| "pinyin."); | ||
| exit(-1); | ||
| } | ||
| if (meta_data.use_espeak && config_.model.zipvoice.data_dir.empty()) { | ||
| SHERPA_ONNX_LOGE("Please provide --zipvoice-data-dir for espeak-ng."); | ||
| exit(-1); | ||
| } | ||
| frontend_ = std::make_unique<OfflineTtsZipvoiceFrontend>( | ||
| config_.model.zipvoice.tokens, config_.model.zipvoice.data_dir, | ||
| config_.model.zipvoice.pinyin_dict, meta_data, config_.model.debug); | ||
| } |
There was a problem hiding this comment.
Don’t call exit() in library code; throw instead
Surface configuration errors as exceptions.
if (meta_data.use_pinyin && config_.model.zipvoice.pinyin_dict.empty()) {
SHERPA_ONNX_LOGE(
"Please provide --zipvoice-pinyin-dict for converting Chinese into "
"pinyin.");
- exit(-1);
+ throw std::runtime_error("--zipvoice-pinyin-dict is required by the model");
}
if (meta_data.use_espeak && config_.model.zipvoice.data_dir.empty()) {
SHERPA_ONNX_LOGE("Please provide --zipvoice-data-dir for espeak-ng.");
- exit(-1);
+ throw std::runtime_error("--zipvoice-data-dir is required by the model");
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void InitFrontend() { | |
| const auto &meta_data = model_->GetMetaData(); | |
| if (meta_data.use_pinyin && config_.model.zipvoice.pinyin_dict.empty()) { | |
| SHERPA_ONNX_LOGE( | |
| "Please provide --zipvoice-pinyin-dict for converting Chinese into " | |
| "pinyin."); | |
| exit(-1); | |
| } | |
| if (meta_data.use_espeak && config_.model.zipvoice.data_dir.empty()) { | |
| SHERPA_ONNX_LOGE("Please provide --zipvoice-data-dir for espeak-ng."); | |
| exit(-1); | |
| } | |
| frontend_ = std::make_unique<OfflineTtsZipvoiceFrontend>( | |
| config_.model.zipvoice.tokens, config_.model.zipvoice.data_dir, | |
| config_.model.zipvoice.pinyin_dict, meta_data, config_.model.debug); | |
| } | |
| void InitFrontend() { | |
| const auto &meta_data = model_->GetMetaData(); | |
| if (meta_data.use_pinyin && config_.model.zipvoice.pinyin_dict.empty()) { | |
| SHERPA_ONNX_LOGE( | |
| "Please provide --zipvoice-pinyin-dict for converting Chinese into " | |
| "pinyin."); | |
| throw std::runtime_error("--zipvoice-pinyin-dict is required by the model"); | |
| } | |
| if (meta_data.use_espeak && config_.model.zipvoice.data_dir.empty()) { | |
| SHERPA_ONNX_LOGE("Please provide --zipvoice-data-dir for espeak-ng."); | |
| throw std::runtime_error("--zipvoice-data-dir is required by the model"); | |
| } | |
| frontend_ = std::make_unique<OfflineTtsZipvoiceFrontend>( | |
| config_.model.zipvoice.tokens, config_.model.zipvoice.data_dir, | |
| config_.model.zipvoice.pinyin_dict, meta_data, config_.model.debug); | |
| } |
🤖 Prompt for AI Agents
In sherpa-onnx/csrc/offline-tts-zipvoice-impl.h around lines 101 to 117, replace
the calls to exit(-1) when required config is missing with thrown exceptions
(e.g., throw std::runtime_error or a more specific exception type used in the
project) that include the same descriptive error messages; construct the
exception with the message used in SHERPA_ONNX_LOGE, keep logging the error if
desired, and allow the caller to handle or propagate the exception instead of
terminating the process.
|
I test it in my local machine, the c++ / python binary runs normally, I will merge it first, so people can go head based on this PR. Possible potential issues will be fixed in the following PRs. |
|
Read the comments in the top of |
|
Amazing work done here! any chance you have a pure Python script that uses ZipVoice? Tried to vibe-code it without success Details#!/usr/bin/env python3
"""
Minimal ZipVoice ONNX inference example with hardcoded phonemes.
No phonemization - just direct phoneme-to-ID conversion using tokens.txt
"""
import numpy as np
import onnxruntime as ort
import wave
import soundfile as sf
import kaldi_native_fbank as knf
def load_tokens(tokens_file):
"""Load token-to-ID mapping from tokens.txt"""
token2id = {}
with open(tokens_file, 'r', encoding='utf-8') as f:
for line in f:
parts = line.strip().split()
if len(parts) == 2:
token, idx = parts
token2id[token] = int(idx)
elif len(parts) == 1:
# Single number means space token
token2id[' '] = int(parts[0])
return token2id
def phonemes_to_ids(phonemes, token2id):
"""Convert list of phonemes to token IDs"""
ids = []
for phoneme in phonemes:
if phoneme in token2id:
ids.append(token2id[phoneme])
else:
print(f"Warning: Unknown phoneme '{phoneme}', skipping")
return np.array(ids, dtype=np.int64)
def extract_mel_spectrogram_kaldi(audio, sample_rate, n_mels=100):
"""
Extract mel spectrogram using kaldi-native-fbank (same as sherpa-onnx uses)
Args:
audio: Audio samples as numpy array
sample_rate: Sample rate of the audio
n_mels: Number of mel bands
Returns:
mel_spectrogram: [num_frames, n_mels] mel spectrogram
"""
# Configure fbank options
opts = knf.FbankOptions()
opts.frame_opts.samp_freq = sample_rate
opts.frame_opts.frame_shift_ms = 256.0 / sample_rate * 1000 # hop_length in ms
opts.frame_opts.frame_length_ms = 1024.0 / sample_rate * 1000 # n_fft in ms
opts.frame_opts.dither = 0.0
opts.frame_opts.preemph_coeff = 0.0
opts.frame_opts.remove_dc_offset = False
opts.frame_opts.window_type = "hann"
opts.frame_opts.snip_edges = True
opts.mel_opts.num_bins = n_mels
opts.mel_opts.low_freq = 0
opts.mel_opts.high_freq = sample_rate / 2
opts.mel_opts.is_librosa = True # Use librosa-style mel
# Create fbank computer
fbank = knf.OnlineFbank(opts)
# Accept waveform
fbank.accept_waveform(sample_rate, audio.tolist())
fbank.input_finished()
# Get features
num_frames = fbank.num_frames_ready
mel_spec = []
for i in range(num_frames):
frame = fbank.get_frame(i)
mel_spec.append(frame)
mel_spec = np.array(mel_spec)
return mel_spec
def run_zipvoice_inference(
text_model_path,
flow_matching_model_path,
vocoder_path,
tokens_file,
phonemes,
prompt_phonemes,
prompt_audio_path=None,
output_wav="output.wav",
speed=1.0,
num_steps=10,
feat_scale=0.1,
t_shift=0.5,
guidance_scale=1.0,
):
# Load token mappings
print("Loading tokens...")
token2id = load_tokens(tokens_file)
print(f"Loaded {len(token2id)} tokens")
# Convert phonemes to IDs
print(f"Converting phonemes to IDs...")
token_ids = phonemes_to_ids(phonemes, token2id)
prompt_token_ids = phonemes_to_ids(prompt_phonemes, token2id)
print(f"Text tokens: {token_ids.shape} - {token_ids}")
print(f"Prompt tokens: {prompt_token_ids.shape} - {prompt_token_ids}")
# Prepare inputs
tokens_input = token_ids.reshape(1, -1) # [1, seq_len]
prompt_tokens_input = prompt_token_ids.reshape(1, -1) # [1, prompt_len]
# Load or create prompt features (mel spectrogram)
feat_dim = 100
if prompt_audio_path:
print(f"Loading prompt audio from: {prompt_audio_path}")
# Load audio file
prompt_audio, sr = sf.read(prompt_audio_path)
# Resample if needed (ZipVoice uses 24kHz)
target_sr = 24000
if sr != target_sr:
print(f"Resampling from {sr}Hz to {target_sr}Hz")
from scipy import signal as sig
num_samples = int(len(prompt_audio) * target_sr / sr)
prompt_audio = sig.resample(prompt_audio, num_samples)
# Extract mel spectrogram using kaldi-native-fbank
mel = extract_mel_spectrogram_kaldi(prompt_audio, target_sr, n_mels=feat_dim)
# Scale by feat_scale
mel = mel * feat_scale
# Add batch dimension [1, time, mels]
prompt_features = mel[np.newaxis, :, :].astype(np.float32)
prompt_num_frames = mel.shape[0]
print(f"Extracted mel spectrogram: {prompt_features.shape}")
else:
# Use dummy features if no prompt audio provided
print("No prompt audio provided, using random noise")
prompt_num_frames = max(5, len(prompt_phonemes) // 2)
prompt_features = np.random.randn(1, prompt_num_frames, feat_dim).astype(np.float32) * feat_scale
prompt_feat_len = np.array([prompt_num_frames], dtype=np.int64)
speed_input = np.array([speed], dtype=np.float32)
# Load and run text encoder
print("\nRunning text encoder...")
text_sess = ort.InferenceSession(text_model_path)
text_outputs = text_sess.run(
None,
{
"tokens": tokens_input,
"prompt_tokens": prompt_tokens_input,
"prompt_features_len": prompt_feat_len,
"speed": speed_input,
}
)
text_condition = text_outputs[0]
print(f"Text condition shape: {text_condition.shape}")
# Flow matching inference
print(f"\nRunning flow matching model ({num_steps} steps)...")
batch_size = 1
num_frames = text_condition.shape[1]
# Initialize x with random noise
x = np.random.randn(batch_size, num_frames, feat_dim).astype(np.float32)
# Initialize speech condition (zero-padded prompt features)
speech_condition = np.zeros((batch_size, num_frames, feat_dim), dtype=np.float32)
speech_condition[:, :prompt_num_frames, :] = prompt_features
# Compute timesteps
timesteps = []
for i in range(num_steps + 1):
t = i / num_steps
timesteps.append(t_shift * t / (1.0 + (t_shift - 1.0) * t))
guidance_scale_input = np.array([guidance_scale], dtype=np.float32)
# Load flow matching model
fm_sess = ort.InferenceSession(flow_matching_model_path)
# Iterative refinement
for step in range(num_steps):
t = np.array([timesteps[step]], dtype=np.float32)
fm_outputs = fm_sess.run(
None,
{
"t": t,
"x": x,
"text_condition": text_condition,
"speech_condition": speech_condition,
"guidance_scale": guidance_scale_input,
}
)
v = fm_outputs[0]
delta_t = timesteps[step + 1] - timesteps[step]
x = x + v * delta_t
if (step + 1) % 5 == 0:
print(f" Step {step + 1}/{num_steps}")
# Remove prompt frames
mel = x[:, prompt_num_frames:, :]
print(f"Generated mel shape: {mel.shape}")
# Transpose mel: [batch, time, feat] -> [batch, feat, time]
mel = np.transpose(mel, (0, 2, 1))
# Scale back
mel = mel / feat_scale
# Run vocoder
print("\nRunning vocoder...")
vocoder_sess = ort.InferenceSession(vocoder_path)
audio = vocoder_sess.run(None, {"mels": mel})[0]
# Get audio array
audio = audio.flatten()
print(f"Generated audio shape: {audio.shape}")
# Save to WAV file
sample_rate = 24000 # ZipVoice default
audio_int16 = (audio * 32767).astype(np.int16)
with wave.open(output_wav, 'wb') as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(sample_rate)
wf.writeframes(audio_int16.tobytes())
print(f"\nAudio saved to: {output_wav}")
print(f"Duration: {len(audio) / sample_rate:.2f}s")
if __name__ == "__main__":
# Example usage with hardcoded phonemes
# Using simple phonemes from the ZipVoice tokens.txt
# Hardcoded phonemes for "hello" (using simple tokens from tokens.txt)
# These are basic IPA phonemes: h-e-l-o
text_phonemes = ['h', 'e', 'l', 'o', ' ', 'w', 'ɜː', 'l', 'd']
# Prompt phonemes (shorter reference)
prompt_phonemes = ['h', 'a', 'i']
# Model paths
MODEL_DIR = "sherpa-onnx-zipvoice-distill-zh-en-emilia"
TEXT_MODEL = f"{MODEL_DIR}/text_encoder.onnx"
FLOW_MATCHING_MODEL = f"{MODEL_DIR}/fm_decoder.onnx"
VOCODER = f"{MODEL_DIR}/vocos_24khz.onnx"
TOKENS_FILE = f"{MODEL_DIR}/tokens.txt"
print("="*60)
print("ZipVoice Minimal Inference Example")
print("="*60)
print(f"\nText phonemes: {text_phonemes}")
print(f"Prompt phonemes: {prompt_phonemes}")
# Use the actual prompt audio file that came with the model
PROMPT_AUDIO = f"{MODEL_DIR}/prompt.wav"
run_zipvoice_inference(
text_model_path=TEXT_MODEL,
flow_matching_model_path=FLOW_MATCHING_MODEL,
vocoder_path=VOCODER,
tokens_file=TOKENS_FILE,
phonemes=text_phonemes,
prompt_phonemes=prompt_phonemes,
prompt_audio_path=PROMPT_AUDIO,
output_wav="output.wav",
speed=1.0,
num_steps=10,
)
|
Please visit https://github.com/k2-fsa/ZipVoice |
|
sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/vocos_24khz.onnx |
Can you please tell us about the issue you're experiencing? @UncleLLD |
after I update sherpa-onnx, the problem was gone, thanks. Saved to ./generated.wav |
Summary by CodeRabbit
New Features
Python API
Improvements
Chores
Tests