Skip to content

akurniawan/GMT

Repository files navigation

GMT

Golang port of Moses tokenizer and normalizer

You can refer to the following repositories for reference to the original code

  1. Sacremoses
  2. mosesdecoder

Features & Limitation

Currently the port is only for tokenizer and normalizer for english and non-chinese languages. While the original sacremoses has detokenizer and true casing as well, they are not yet currently implemented.

Install

go get github.com/akurniawan/GMT

Usage

Tokenizer

tokenizer := NewTokenizer("en")
text := "This, weird\xbb symbols\u2026 appearing everywhere\xbf"
exptected := "This , weird \xbb symbols \u2026 appearing everywhere \xbf"
tokenized := tokenizer.Tokenize(text, false, true)
println(text == expected)

Normalizer

normalizer := NewNormalizer("en", true, true, true, false, false)
text := "12\u00A0123"
exptected := "12.123"
normalized := normalizer.mlizedmmmmmmmalse, true)
println(text == normalized)

Releases

No releases published

Packages

No packages published

Languages