JMTEB: Japanese Massive Text Embedding Benchmark

JMTEB is a benchmark for evaluating Japanese text embedding models. It consists of 5 tasks.

This is an easy-to-use evaluation script designed for JMTEB evaluation.

Quick start

git clone [email protected]:sbintuitions/JMTEB
cd JMTEB
poetry install
poetry run pytest tests

The following command evaluate the specified model on the all the tasks in JMTEB.

poetry run python -m jmteb \
  --embedder SentenceBertEmbedder \
  --embedder.model_name_or_path "<model_name_or_path>" \
  --save_dir "output/<model_name_or_path>"

Note

In order to gurantee the robustness of evaluation, a validation dataset is mandatorily required for hyperparameter tuning. For a dataset that doesn't have a validation set, we set the validation set the same as the test set.

By default, the evaluation tasks are read from src/jmteb/configs/jmteb.jsonnet. If you want to evaluate the model on a specific task, you can specify the task via --evaluators option with the task config.

poetry run python -m jmteb \
  --evaluators "src/configs/tasks/jsts.jsonnet" \
  --embedder SentenceBertEmbedder \
  --embedder.model_name_or_path "<model_name_or_path>" \
  --save_dir "output/<model_name_or_path>"

Note

Some tasks (e.g., AmazonReviewClassification in classification, JAQKET and Mr.TyDi-ja in retrieval, esci in reranking) are time-consuming and memory-consuming. Heavy retrieval tasks take hours to encode the large corpus, and use much memory for the storage of such vectors. If you want to exclude them, add --eval_exclude "['amazon_review_classification', 'mrtydi', 'jaqket', 'esci']".

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github		.github
docs		docs
src/jmteb		src/jmteb
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JMTEB: Japanese Massive Text Embedding Benchmark

Quick start

About

Releases

Packages

Languages

License

oshizo/JMTEB

Folders and files

Latest commit

History

Repository files navigation

JMTEB: Japanese Massive Text Embedding Benchmark

Quick start

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages