Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

自定义词典时,若整个句子为词典中的词时,模型输出预期外的切分。 #466

Closed
wangyuxinwhy opened this issue Dec 28, 2020 · 3 comments

Comments

@wangyuxinwhy
Copy link

wangyuxinwhy commented Dec 28, 2020

ltp version: 4.1.1

from ltp import LTP

ltp = LTP("base")
ltp.add_words("跟谁学")
print(ltp.seg(['跟谁学']))

输出:
([['跟', '谁', '学']], {'word_cls': tensor([[[-4.0777e-01, 4.0268e-01, 1.2457e-01, -1.5838e-01, 6.1400e-03 ...

预期中应该讲整个句子作为一个词,将源码 ltp.algorithms.Triemaximum_forward_matching 方法中的代码
while end <= text_len and curr_len < max_len: 修改为 while end <= text_len and curr_len <= max_len 则可以正确切分。
即将 < 修改为 <=

不清楚是作者故意这么设计的,还是一个小 bug?

另外,LTP4 是真的好用,永远滴神。

@wangyuxinwhy
Copy link
Author

按照上方的修改后,同样的代码,输出如下:

[['跟谁学']], {'word_cls': tensor([[[-4.0777e-01, 4.0268e-01, 1.2457e-01, -1.5838e- ...

这个感觉更符合直觉。

@AlongWY
Copy link
Contributor

AlongWY commented Dec 28, 2020

应该是写错了,不过问题不在这,是由于分词 TAG 导致的问题

@AlongWY
Copy link
Contributor

AlongWY commented Dec 28, 2020

稍后你可以试一下 4.1.3.post1 版本,我测试了一下已经修复了这个问题

@AlongWY AlongWY closed this as completed Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants