自定义词典时，若整个句子为词典中的词时，模型输出预期外的切分。 #466

wangyuxinwhy · 2020-12-28T06:53:33Z

ltp version: 4.1.1

from ltp import LTP

ltp = LTP("base")
ltp.add_words("跟谁学")
print(ltp.seg(['跟谁学']))

输出：
([['跟', '谁', '学']], {'word_cls': tensor([[[-4.0777e-01, 4.0268e-01, 1.2457e-01, -1.5838e-01, 6.1400e-03 ...

预期中应该讲整个句子作为一个词，将源码 ltp.algorithms.Trie 的 maximum_forward_matching 方法中的代码
while end <= text_len and curr_len < max_len: 修改为 while end <= text_len and curr_len <= max_len 则可以正确切分。
即将 < 修改为 <=

不清楚是作者故意这么设计的，还是一个小 bug？

另外，LTP4 是真的好用，永远滴神。

The text was updated successfully, but these errors were encountered:

wangyuxinwhy · 2020-12-28T06:54:51Z

按照上方的修改后，同样的代码，输出如下：

[['跟谁学']], {'word_cls': tensor([[[-4.0777e-01, 4.0268e-01, 1.2457e-01, -1.5838e- ...

这个感觉更符合直觉。

AlongWY · 2020-12-28T08:12:06Z

应该是写错了，不过问题不在这，是由于分词 TAG 导致的问题

AlongWY · 2020-12-28T08:16:14Z

稍后你可以试一下 4.1.3.post1 版本，我测试了一下已经修复了这个问题

AlongWY added a commit that referenced this issue Dec 28, 2020

修复由于分词词表带来的切分不一致问题 #466

0f3c956

AlongWY closed this as completed Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

自定义词典时，若整个句子为词典中的词时，模型输出预期外的切分。 #466

自定义词典时，若整个句子为词典中的词时，模型输出预期外的切分。 #466

wangyuxinwhy commented Dec 28, 2020 •

edited

Loading

wangyuxinwhy commented Dec 28, 2020

AlongWY commented Dec 28, 2020

AlongWY commented Dec 28, 2020

自定义词典时，若整个句子为词典中的词时，模型输出预期外的切分。 #466

自定义词典时，若整个句子为词典中的词时，模型输出预期外的切分。 #466

Comments

wangyuxinwhy commented Dec 28, 2020 • edited Loading

wangyuxinwhy commented Dec 28, 2020

AlongWY commented Dec 28, 2020

AlongWY commented Dec 28, 2020

wangyuxinwhy commented Dec 28, 2020 •

edited

Loading