Skip to content

build_input.py的select_best_length逻辑应该有问题 #2

@kknd21988

Description

@kknd21988
# 原max_length选择方法,逻辑有问题
# for i in len_dict:
#     rate = i[1] / all_sent
#     cover_rate += rate
#     if cover_rate >= limit_ratio:
#         max_length = i[0]
#         break

分析:len_dict是句子长度的频数统计list[(15,3700),(12,2800),(8,500)...(20,30)],每个元素(句长,频数)
按上述逻辑,当3700+2800+500大于总频数95%时,max_len是8,这里就产生了错误。

应该修改为:

改成:将len_dict按照句子长度从小到大排序,从大到小筛选

temp = sorted(len_dict, key=lambda x:x[0], reverse=False)
for i in temp:
    rate = i[1] / all_sent
    cover_rate += rate
    if cover_rate >= limit_ratio:
        max_length = i[0]
        break

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions