A fast concatenated-word segmentation library written in Rust, inspired by wordninja and wordsegment. The binding uses pyo3 to interact with the rust package.
pip3 install pywordsegment
import pywordsegment
# The internal UNIGRAMS & BIGRAMS corpuses are lazy initialized
# once per the whole module. Multiple WordSegmenter instances would
# not create new dictionaries.
word_segmenter = pywordsegment.WordSegmenter()
# Segments a word to its parts
word_segmenter.segment(
text="theusashops",
)
# ["the", "usa", "shops"]
# This function checks whether the substring exists as a whole segment
# inside text.
word_segmenter.exist_as_segment(
substring="inter",
text="internationalairport",
)
# False
word_segmenter.exist_as_segment(
substring="inter",
text="intermilan",
)
# True
Distributed under the MIT License. See LICENSE
for more information.
Gal Ben David - [email protected]
Project Link: https://github.com/intsights/pywordsegment