#tokenizer #parser #symbols #words #numbers #operator #customizable

tinytoken

Library for tokenizing text into words, numbers, symbols, and more, with customizable parsing options

5 releases

0.1.4 Nov 12, 2024
0.1.3 Nov 12, 2024
0.1.2 Nov 9, 2024
0.1.1 Nov 9, 2024
0.1.0 Nov 9, 2024

#1391 in Text processing

28 downloads per month

MIT license

27KB
616 lines

tinytoken

This library provides a tokenizer for parsing and categorizing different types of tokens, such as words, numbers, strings, characters, symbols, and operators. It includes configurable options to handle various tokenization rules and formats, enabling fine-grained control over how text input is parsed.

Example

use tinytoken::{Tokenizer, TokenizerBuilder, Choice};

fn main() {
    let tokenizer = TokenizerBuilder::new()
        .parse_char_as_string(true)
        .allow_digit_separator(Choice::Yes('_'))
        .add_symbol('$')
        .add_operators(&[' ', '-'])
        .build("let x = 123_456   0xFF");

    match tokenizer.tokenize() {
        Ok(tokens) => {
            for token in tokens {
                println!("{:?}", token);
            }
        }
        Err(err) => {
            eprintln!("Tokenization error: {err}");
        }
    }
}

Contributions

Feel free to send a PR to improve and/or extend the tool capabilities

No runtime deps