-
Notifications
You must be signed in to change notification settings - Fork 90
Open
Description
# 原max_length选择方法,逻辑有问题
# for i in len_dict:
# rate = i[1] / all_sent
# cover_rate += rate
# if cover_rate >= limit_ratio:
# max_length = i[0]
# break
分析:len_dict是句子长度的频数统计list[(15,3700),(12,2800),(8,500)...(20,30)],每个元素(句长,频数)
按上述逻辑,当3700+2800+500大于总频数95%时,max_len是8,这里就产生了错误。
应该修改为:
改成:将len_dict按照句子长度从小到大排序,从大到小筛选
temp = sorted(len_dict, key=lambda x:x[0], reverse=False)
for i in temp:
rate = i[1] / all_sent
cover_rate += rate
if cover_rate >= limit_ratio:
max_length = i[0]
break
Metadata
Metadata
Assignees
Labels
No labels