fix(C9): 启用 hybrid_search 中真实的 BM25(jieba) 检索 + RRF 替代 round-robin#106
Merged
Conversation
- 接入 rank_bm25.BM25Okapi + jieba 精确分词 + 中文停用词过滤 - 新增 _rrf_merge:标准 RRF 公式 score=Σ1/(k+rank),k=60,按 node_id 去重 - hybrid_search 重写为三路(dual_level + vector + bm25)→ RRF 融合 - 移除占位的 langchain BM25Retriever(原代码初始化但从未被查询过) 在 100 题自建评测集上,控制其他变量对比 round-robin vs RRF: - MRR@10 +0.17(排序质量提升) - Hit@5 / Recall@5 已触顶不变(召回路径未变) - Latency P50 ≈ 持平(jieba 首次加载一次性开销)
Member
|
感谢 pr!jieba 这个问题本来是打算作为一个错误案例,在第十章项目中提及。目前 pr 的这个 RRF 的问题,看起来是考虑了跨来源按菜谱去重,但是有考虑到同一道菜多个 chunk 共享 node_id 后在同一 source 内重复加分的问题吗,不知道是否方便优化一下 |
…chunk
- 原实现 score 累加未按 source 去重,同一 recipe 的多个 chunk 在同一路里
会被反复加分,违反 RRF "每个 ranker 对每个 doc 贡献一次" 的语义
- canonical doc 原按"输入顺序首次见到"选取,受 ranked_lists 顺序支配;
改为按全局最小 rank 选取,rank 相同时按 ranked_lists 顺序优先
- 同时把每个 source 的 chunk 命中次数另存到 metadata.rrf_chunk_hits,
便于后续分析
- _rrf_merge 不再 mutate 输入 Document.metadata,返回新 Document 对象
更新之后在100 题评测集上 MRR@10 0.898 → 0.939,Faithfulness 0.680 → 0.734。
Addresses review comment in datawhalechina#106
Contributor
Author
|
感谢指出!的确是 bug:同 source 内同 node_id 多 chunk 会被重复加分。已在 commit 9e4526a 修复,主要改动:
评测集上 MRR@10 / faithfulness 双双提升,详细数字见 commit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
问题
Chapter 9
hybrid_retrieval.py当前hybrid_search实现存在两处问题:1. BM25 是占位实现,实际从未参与检索
__init__/initialize中创建了self.bm25_retriever = BM25Retriever.from_documents(chunks)(第 63-65 行)hybrid_search(query, top_k)只调用dual_level_retrieval+vector_search_enhanced,从未查询过bm25_retrieverlangchain.BM25Retriever默认分词是text.split(),对中文等于"整句作为一个 token",无检索意义2. Round-robin 合并完全不利用分数信息
hybrid_search末尾对每个 doc 设了metadata["final_score"](vector 路还做了1.0 - vector_score的 cosine→similarity 转换),但函数最终用final_docs = merged_docs[:top_k]直接按 round-robin 位置切片返回,从未按final_score重排序final_scoremetadata并未使用,相关 doc 的最终排名完全取决于在原列表中的位置,与真实相关度无关改动概述
启用真正的中文 BM25
jieba精确分词 + 中文停用词表(按烹饪场景手挑约 60 词,覆盖助词/疑问词/语气词等)rank_bm25.BM25Okapi直接获取 score(langchain BM25Retriever 不暴露 score)chunks上,两路候选集合一致,避免 RRF 融合时出现"某 doc 只在一路索引里"的覆盖偏差RRF 融合替代 Round-robin
score(d) = Σ_i 1 / (k + rank_i(d)),k=60dual_level/vector/bm25,每路取max(top_k * 2, 10)候选给 RRF 重排node_id去重,hash 兜底(不同检索路径同 recipe 的 page_content 可能拼接了"相关信息",按 hash 会漏融合)评测对比
在自建 100 题评测集(7 类问题 × 3 难度)上对比 round-robin vs RRF:
按问题类型分组看 MRR@10 涨幅:
由于改动只发生在
hybrid_search函数内部,不影响其他检索路径——分组数据也确认了这一点。兼容性
requirements.txt新增jieba>=0.42.1(rank-bm25已有)HybridRetrievalModule.hybrid_search(query, top_k)外部签名不变from langchain_community.retrievers import BM25Retriever