Peter's LLM Leaderboard

Evaluating the capabilities of large language models is very difficult. There are already many public leaderboards that do this work. However, public leaderboards are often prone to malicious manipulation and some evaluation benchmarks are not suitable for real application scenarios. So I decided to create my own assessment benchmark and evaluate of my favourite models.

Leaderboard

Model	Total	Knowledge	Coding	Censorship	Instruction	Math	Extraction	Reasoning	Summarizing	Writing
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf	55	6	8	5	6	5	7	8	6	4
miqu-1-70b-iq2_xs.gguf	54	7	8	6	6	3	6	8	6	4
Smaug-34B-v0.1_Q4_K_M.gguf	53	7	8	6	5	4	6	7	6	4
Starling-LM-7B-beta-Q8_0.gguf	52	6	8	6	6	5	7	5	6	3
openchat-3.5-0106.Q8_0.gguf	52	7	8	6	6	5	7	4	6	3
senku-70b-iq2_xxs.gguf	51	6	8	6	7	5	6	4	6	3
Hermes-2-Pro-Mistral-7B.Q8_0.gguf	48	6	8	4	5	5	6	6	6	2
Nous-Hermes-2-Mistral-7B-DPO.Q8_0.gguf	46	6	8	5	4	4	6	4	6	3
nous-capybara-34b.Q4_K_M.gguf	46	6	6	3	6	3	7	5	6	4
gemma-7b-it.Q8_0.gguf	44	6	7	6	5	4	5	2	6	3
gemma-2b-it.Q8_0.gguf	36	3	7	6	3	2	2	4	6	3
phi-2.Q8_0.gguf	26	6	5	5	3	3	1	2	1	0
qwen1_5-1_8b-chat-q8_0.gguf	25	3	5	3	2	1	5	2	2	2

Note:

Due to limitations of my GPU (24G VRAM), I can only run quantized models, so the performance should be lower that the original models.

Detailed Results

Evaluation Questions

I collected 61 test questions from the Internet, it includes:

Knowledge (7)
Coding (8)
Censorship (6)
Instruction (6)
Math (7)
Extraction (7)
Reasoning (10)
Summarizing (6)
Writing (4)

Download Models

Model Info

Model	Size	Required VRAM	Required GPUs
miqu-1-70b-iq2_xs.gguf	19G	23.8G	>= RTX-3090
senku-70b-iq2_xxs.gguf	20G	22G	>= RTX-3090
Smaug-34B-v0.1_Q4_K_M.gguf	20G	23.8G	>= RTX-3090
nous-capybara-34b.Q4_K_M.gguf	20G	23.8G	>= RTX-3090
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf	28.4G	>24G	>= RTX-3090
openchat-3.5-0106.Q8_0.gguf	7.7G	9.4G	>= RTX-3070
Starling-LM-7B-beta-Q8_0.gguf	7.7G	9.4G	>= RTX-3070
gemma-7b-it.Q8_0.gguf	9.1G	15G	>= RTX-3080
gemma-2b-it.Q8_0.gguf			>= RTX-3070
phi-2.Q8_0.gguf			>= RTX-3070
qwen1_5-1_8b-chat-q8_0.gguf			>= RTX-3070

Evaluation Platform

GeForce RTX 4090 (24G VRAM)
Intel I9-14900K
64G RAM
Ubuntu 22.04
Python 3.10
llama-cpp-python

Run in local env

1. Install Dependencies

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install -r requirements.txt

2. Download Models

Download models from huggingface and put gguf files to models folder.

3. Create model config file

Create model config file in models folder, here is an example:

{
  "name": "gemma-2b",
  "chatFormat": "gemma",
  "modelPath": "gemma-2b-it.Q8_0.gguf",
  "context": 8192
}

4. Evaluation

python evaluate.py -m models/gemma-7b-it.json

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
models		models
questions		questions
results		results
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
report.py		report.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Peter's LLM Leaderboard

Leaderboard

Detailed Results

Evaluation Questions

Download Models

Model Info

Evaluation Platform

Run in local env

1. Install Dependencies

2. Download Models

3. Create model config file

4. Evaluation

About

Releases

Packages

Languages

perfectspr/llm-leaderboard

Folders and files

Latest commit

History

Repository files navigation

Peter's LLM Leaderboard

Leaderboard

Detailed Results

Evaluation Questions

Download Models

Model Info

Evaluation Platform

Run in local env

1. Install Dependencies

2. Download Models

3. Create model config file

4. Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages