- Open-source system-level translation framework
- Provides fluent and natural translations utilizing LLMs
- Ensures privacy and security with local translation processes
- Capable of zero-shot in-task translations
- Utilizes QLoRA fine-tuned models for enhanced accuracy
- Employs both general and in-task specific translation memories and glossaries
- Incorporates preceding text in document-level translations for improved context understanding
- Combining QLoRA with in-task translation memory and glossary resulted in ~45% increase in aggregated WMT23 translation scores, benchmarked against the Mistral 7b Instruct model
- Demonstrated high recall for valid translation memories and glossaries, including previous translations and character names
- Surpassed the performance of the native TowerInstruct model in three (Ja<->En, Zh->En) out of the four WMT23 language direction tested
- Outperformed DeepL in translating the Japanese web novel "That Time I Got Reincarnated as a Slime" into Chinese using in-task RAG
- Japanese to Chinese translation improvements:
- 29% sacrebleu
- 0.4% comet22
- Japanese to Chinese translation improvements:
πSee the write-up for more detailsπ
Simply run:
pip install t-ragx
or if you are feeling lucky:
pip install git https://github.com/rayliuca/T-Ragx.git
See the wiki page instructions
Note: you can access preview read-only T-Ragx Elasticsearch services at https://t-ragx-fossil.rayliu.ca
and https://t-ragx-fossil2.rayliu.ca
(But you will need a personal Elasticsearch service to add your in-task memories)
Download the conda environment.yml
file and run:
conda env create -f environment.yml
## or with mamba
# mamba env create -f environment.yml
Which will crate a t_ragx
environment that's compatible with this project
Download the requirment.txt
file and run:
Use your favourite virtual environment, and run:
pip install -r requirment.txt
Initiate the input processor:
import t_ragx
# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()
# Load/ point to the demo resources
input_processor.load_general_glossary("https://l8u0.c18.e2-1.dev/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host=["https://t-ragx-fossil.rayliu.ca", "https://t-ragx-fossil2.rayliu.ca"])
Using the llama-cpp-python
backend:
import t_ragx
# T-Ragx currently support
# Huggingface transformers: MistralModel, InternLM2Model
# Ollama API: OllamaModel
# OpenAI API: OpenAIModel
# Llama-cpp-python backend: LlamaCppPythonModel
mistral_model = t_ragx.models.LlamaCppPythonModel(
repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
filename="*Q4_K_M*",
# see https://huggingface.co/rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
# for other files
chat_format="mistral-instruct",
model_config={'n_ctx':2048}, # increase the context window
)
t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)
Translate!
t_ragx_translator.batch_translate(
source_text_list, # the input text list to translate
pre_text_list=pre_text_list, # optional, including the preceding context to translate the document level
# Can generate via:
# pre_text_list = t_ragx.utils.helper.get_preceding_text(source_text_list, max_sent=3)
source_lang_code='ja',
target_lang_code='en',
memory_search_args={'top_k': 3} # optional, pass additional arguments to input_processor.search_memory
)
Note: you could use any LLMs by using the API models (i.e. OllamaModel
or OpenAIModel
) or extending the t_ragx.models.BaseModel
class
The following models were finetuned using the T-Ragx prompts, so they might work a bit better than some of the off-the-shelve models with T-Ragx
Source Model | Model Type | Quantization | Fine-tuned Model |
---|---|---|---|
mistralai/Mistral-7B-Instruct-v0.2 | LoRA | rayliuca/TRagx-Mistral-7B-Instruct-v0.2 | |
merged AWQ | AWQ | rayliuca/TRagx-AWQ-Mistral-7B-Instruct-v0.2 | |
merged GGUF | Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32 | rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2 | |
mlabonne/NeuralOmniBeagle-7B | LoRA | rayliuca/TRagx-NeuralOmniBeagle-7B | |
merged AWQ | AWQ | rayliuca/TRagx-AWQ-NeuralOmniBeagle-7B | |
merged GGUF | Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32 | rayliuca/TRagx-GGUF-NeuralOmniBeagle-7B | |
internlm/internlm2-7b | LoRA | rayliuca/TRagx-internlm2-7b | |
merged GPTQ | GPTQ | rayliuca/TRagx-GPTQ-internlm2-7b | |
Unbabel/TowerInstruct-7B-v0.2 | LoRA | rayliuca/TRagx-TowerInstruct-7B-v0.2 |
All of the datasets used in the project
Dataset | Translation Memory | Glossary | Training | Testing | License |
---|---|---|---|---|---|
OpenMantra | β | β | CC BY-NC 4.0 | ||
WMT < 2023 | β | β | for research | ||
ParaMed | β | β | cc-by-4.0 | ||
ted_talks_iwslt | β | β | cc-by-nc-nd-4.0 | ||
JESC | β | β | CC BY-SA 4.0 | ||
MTNT | β | Custom/ Reddit API | |||
WCC-JC | β | β | for research | ||
ASPEC | β | custom, for research | |||
All other ja-en/zh-en OPUS data | β | mix of open licenses: check https://opus.nlpl.eu/ | |||
Wikidata | β | CC0 | |||
Tensei Shitara Slime Datta Ken Wiki | βοΈ in task | CC BY-SA | |||
WMT 2023 | β | for research | |||
Tensei Shitara Slime Datta Ken Web Novel & web translations | βοΈ in task | β | Not used for training or redistribution |