Tokenizer
Segment text, and create Doc
objects with the discovered segment boundaries.
For a deeper understanding, see the docs on
how spaCy’s tokenizer works.
The tokenizer is typically created automatically when a
Language
subclass is initialized and it reads its settings
like punctuation and special case rules from the
Language.Defaults
provided by the language subclass.
Tokenizer.__init__ method
Create a Tokenizer
to create Doc
objects given unicode text. For examples of
how to construct a custom tokenizer with different tokenization rules, see the
usage documentation.
Name | Description |
---|---|
vocab | A storage container for lexical types. Vocab |
rules | Exceptions and special-cases for the tokenizer. Optional[Dict[str, List[Dict[int, str]]]] |
prefix_search | A function matching the signature of re.compile(string).search to match prefixes. Optional[Callable[[str], Optional[Match]]] |
suffix_search | A function matching the signature of re.compile(string).search to match suffixes. Optional[Callable[[str], Optional[Match]]] |
infix_finditer | A function matching the signature of re.compile(string).finditer to find infixes. Optional[Callable[[str], Iterator[Match]]] |
token_match | A function matching the signature of re.compile(string).match to find token matches. Optional[Callable[[str], Optional[Match]]] |
url_match | A function matching the signature of re.compile(string).match to find token matches after considering prefixes and suffixes. Optional[Callable[[str], Optional[Match]]] |
faster_heuristics v3.3.0 | Whether to restrict the final Matcher -based pass for rules to those containing affixes or space. Defaults to True . bool |
Tokenizer.__call__ method
Tokenize a string.
Name | Description |
---|---|
string | The string to tokenize. str |
RETURNS | A container for linguistic annotations. Doc |
Tokenizer.pipe method
Tokenize a stream of texts.
Name | Description |
---|---|
texts | A sequence of unicode texts. Iterable[str] |
batch_size | The number of texts to accumulate in an internal buffer. Defaults to 1000 . int |
YIELDS | The tokenized Doc objects, in order. Doc |
Tokenizer.find_infix method
Find internal split points of the string.
Name | Description |
---|---|
string | The string to split. str |
RETURNS | A list of re.MatchObject objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens. List[Match] |
Tokenizer.find_prefix method
Find the length of a prefix that should be segmented from the string, or None
if no prefix rules match.
Name | Description |
---|---|
string | The string to segment. str |
RETURNS | The length of the prefix if present, otherwise None . Optional[int] |
Tokenizer.find_suffix method
Find the length of a suffix that should be segmented from the string, or None
if no suffix rules match.
Name | Description |
---|---|
string | The string to segment. str |
RETURNS | The length of the suffix if present, otherwise None . Optional[int] |
Tokenizer.add_special_case method
Add a special-case tokenization rule. This mechanism is also used to add custom tokenizer exceptions to the language data. See the usage guide on the languages data and tokenizer special cases for more details and examples.
Name | Description |
---|---|
string | The string to specially tokenize. str |
token_attrs | A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated. Iterable[Dict[int, str]] |
Tokenizer.explain method
Tokenize a string with a slow debugging tokenizer that provides information
about which tokenizer rule or pattern was matched for each token. The tokens
produced are identical to Tokenizer.__call__
except for whitespace tokens.
Name | Description |
---|---|
string | The string to tokenize with the debugging tokenizer. str |
RETURNS | A list of (pattern_string, token_string) tuples. List[Tuple[str, str]] |
Tokenizer.to_disk method
Serialize the tokenizer to disk.
Name | Description |
---|---|
path | A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path -like objects. Union[str,Path] |
keyword-only | |
exclude | String names of serialization fields to exclude. Iterable[str] |
Tokenizer.from_disk method
Load the tokenizer from disk. Modifies the object in place and returns it.
Name | Description |
---|---|
path | A path to a directory. Paths may be either strings or Path -like objects. Union[str,Path] |
keyword-only | |
exclude | String names of serialization fields to exclude. Iterable[str] |
RETURNS | The modified Tokenizer object. Tokenizer |
Tokenizer.to_bytes method
Serialize the tokenizer to a bytestring.
Name | Description |
---|---|
keyword-only | |
exclude | String names of serialization fields to exclude. Iterable[str] |
RETURNS | The serialized form of the Tokenizer object. bytes |
Tokenizer.from_bytes method
Load the tokenizer from a bytestring. Modifies the object in place and returns it.
Name | Description |
---|---|
bytes_data | The data to load from. bytes |
keyword-only | |
exclude | String names of serialization fields to exclude. Iterable[str] |
RETURNS | The Tokenizer object. Tokenizer |
Attributes
Name | Description |
---|---|
vocab | The vocab object of the parent Doc . Vocab |
prefix_search | A function to find segment boundaries from the start of a string. Returns the length of the segment, or None . Optional[Callable[[str], Optional[Match]]] |
suffix_search | A function to find segment boundaries from the end of a string. Returns the length of the segment, or None . Optional[Callable[[str], Optional[Match]]] |
infix_finditer | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of re.MatchObject objects. Optional[Callable[[str], Iterator[Match]]] |
token_match | A function matching the signature of re.compile(string).match to find token matches. Returns an re.MatchObject or None . Optional[Callable[[str], Optional[Match]]] |
rules | A dictionary of tokenizer exceptions and special cases. Optional[Dict[str, List[Dict[int, str]]]] |
Serialization fields
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the exclude
argument.
Name | Description |
---|---|
vocab | The shared Vocab . |
prefix_search | The prefix rules. |
suffix_search | The suffix rules. |
infix_finditer | The infix rules. |
token_match | The token match expression. |
exceptions | The tokenizer exception rules. |