Skip to content

Concatenated-word segmentation Python library written in Rust

License

Notifications You must be signed in to change notification settings

Intsights/PyWordSegment

Repository files navigation

Logo

Concatenated-word segmentation Python library written in Rust

license Python OS Build PyPi

Table of Contents

About The Project

A fast concatenated-word segmentation library written in Rust, inspired by wordninja and wordsegment. The binding uses pyo3 to interact with the rust package.

Built With

Installation

pip3 install pywordsegment

Usage

import pywordsegment

# The internal UNIGRAMS & BIGRAMS corpuses are lazy initialized
# once per the whole module. Multiple WordSegmenter instances would
# not create new dictionaries.
word_segmenter = pywordsegment.WordSegmenter()

# Segments a word to its parts
word_segmenter.segment(
    text="theusashops",
)
# ["the", "usa", "shops"]


# This function checks whether the substring exists as a whole segment
# inside text.
word_segmenter.exist_as_segment(
    substring="inter",
    text="internationalairport",
)
# False

word_segmenter.exist_as_segment(
    substring="inter",
    text="intermilan",
)
# True

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - [email protected]

Project Link: https://github.com/intsights/pywordsegment

About

Concatenated-word segmentation Python library written in Rust

Resources

License

Stars

Watchers

Forks

Packages

No packages published