ParsBench

ParsBench provides toolkits for benchmarking Large Language Models (LLMs) based on the Persian language. It includes various tasks for evaluating LLMs on different topics, benchmarking tools to compare multiple models and rank them, and an easy, fully customizable API for developers to create custom models, tasks, scores, and benchmarks.

Key Features

Variety of Tasks: Evaluate LLMs across various topics.
Benchmarking Tools: Compare and rank multiple models.
Customizable API: Create custom models, tasks, scores, and benchmarks with ease.

Motivation

I was trying to fine-tune an open-source LLM for the Persian language. I needed some evaluation to test the performance and utility of my LLM. It leads me to research and find this paper. It's great work that they prepared some datasets and evaluation methods to test on ChatGPT. They even shared their code in this repository.

So, I thought that I should build a handy framework that includes various tasks and datasets for evaluating LLMs based on the Persian language. I used some parts of their work (Datasets, Metrics, Basic prompt templates) in this library.

Installation

Install Math Equivalence package manually:

pip install git https://github.com/hendrycks/math.git

Install ParsBench using pip:

pip install parsbench

Usage

Evaluating a PreTrained Model

Load the pre-trained model and tokenizer from the HuggingFace and then, evaluate the model using the PersianMath task:

from transformers import AutoModelForCausalLM, AutoTokenizer

from parsbench.models import PreTrainedTransformerModel
from parsbench.tasks import PersianMath

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-72B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-72B-Instruct")

tf_model = PreTrainedTransformerModel(model=model, tokenizer=tokenizer)

with PersianMath() as task:
    results = task.evaluate(tf_model)

Benchmarking Multiple Models with Multiple Tasks

For example, we run our local models using Ollama:

ollama run qwen2
ollama run aya

Then we benchmark those models using the ParsBench.

from parsbench.benchmarks import CustomBenchmark
from parsbench.models import OpenAIModel
from parsbench.tasks import ParsiNLUMultipleChoice, PersianMath, ParsiNLUReadingComprehension

qwen2_model = OpenAIModel(
    api_base_url="http://localhost:11434/v1/",
    api_secret_key="ollama",
    model="qwen2:latest",
)
aya_model = OpenAIModel(
    api_base_url="http://localhost:11434/v1/",
    api_secret_key="ollama",
    model="aya:latest",
)

benchmark = CustomBenchmark(
    models=[qwen2_model, aya_model],
    tasks=[
        ParsiNLUMultipleChoice,
        ParsiNLUReadingComprehension,
        PersianMath,
    ],
)
result = benchmark.run(
    prompt_lang="fa",
    prompt_shots=[0, 3],
    n_first=100,
    sort_by_score=True,
)
result.show_radar_plot()

Available Tasks

Task Name	Score Name	Dataset
ParsiNLU Sentiment Analysis	Exact Match (F1)	ParsiNLU
ParsiNLU Entailment	Exact Match (F1)	ParsiNLU
ParsiNLU Machine Translation En -> Fa	Bleu	ParsiNLU
ParsiNLU Machine Translation Fa -> En	Bleu	ParsiNLU
ParsiNLU Multiple Choice	Exact Match (Accuracy)	ParsiNLU
ParsiNLU Reading Comprehension	Common Tokens (F1)	ParsiNLU
Persian NER	NER Exact Match (F1)	PersianNER
Persian Math	Math Equivalence (Accuracy)	Source
ConjNLI Entailment	Exact Match (F1)	Source
Persian MMLU (Khayyam Challenge)	Exact Match (Accuracy)	Khayyam Challenge
FarsTail Entailment	Exact Match (F1)	FarsTail
Persian News Summary	Rouge	PNSummary
XL-Sum	Rouge	XLSum

You can import the class of above tasks from parsbench.tasks and use it for evaluating your model.

Example Notebooks

Benchmark Aya models:
Benchmark Ava models:
Benchmark Dorna models:
Benchmark MaralGPT models:

Contributing

Contributions are welcome! Please refer to the contribution guidelines for more information on how to contribute.

License

ParsBench is distributed under the Apache-2.0 license.

Contact Information

For support or questions, please contact: [email protected] Feel free to let me know if there are any additional details or changes you'd like to make!

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
docs		docs
parsbench		parsbench
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParsBench

Key Features

Motivation

Installation

Usage

Evaluating a PreTrained Model

Benchmarking Multiple Models with Multiple Tasks

Available Tasks

Example Notebooks

Sponsors

Contributing

License

Contact Information

About

Releases 8

Packages

Languages

License

ParsBench/ParsBench

Folders and files

Latest commit

History

Repository files navigation

ParsBench

Key Features

Motivation

Installation

Usage

Evaluating a PreTrained Model

Benchmarking Multiple Models with Multiple Tasks

Available Tasks

Example Notebooks

Sponsors

Contributing

License

Contact Information

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages