Evaluating the capabilities of large language models is very difficult. There are already many public leaderboards that do this work. However, public leaderboards are often prone to malicious manipulation and some evaluation benchmarks are not suitable for real application scenarios. So I decided to create my own assessment benchmark and evaluate of my favourite models.
Model | Total | Knowledge | Coding | Censorship | Instruction | Math | Extraction | Reasoning | Summarizing | Writing |
---|---|---|---|---|---|---|---|---|---|---|
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf | 55 | 6 | 8 | 5 | 6 | 5 | 7 | 8 | 6 | 4 |
miqu-1-70b-iq2_xs.gguf | 54 | 7 | 8 | 6 | 6 | 3 | 6 | 8 | 6 | 4 |
Smaug-34B-v0.1_Q4_K_M.gguf | 53 | 7 | 8 | 6 | 5 | 4 | 6 | 7 | 6 | 4 |
Starling-LM-7B-beta-Q8_0.gguf | 52 | 6 | 8 | 6 | 6 | 5 | 7 | 5 | 6 | 3 |
openchat-3.5-0106.Q8_0.gguf | 52 | 7 | 8 | 6 | 6 | 5 | 7 | 4 | 6 | 3 |
senku-70b-iq2_xxs.gguf | 51 | 6 | 8 | 6 | 7 | 5 | 6 | 4 | 6 | 3 |
Hermes-2-Pro-Mistral-7B.Q8_0.gguf | 48 | 6 | 8 | 4 | 5 | 5 | 6 | 6 | 6 | 2 |
Nous-Hermes-2-Mistral-7B-DPO.Q8_0.gguf | 46 | 6 | 8 | 5 | 4 | 4 | 6 | 4 | 6 | 3 |
nous-capybara-34b.Q4_K_M.gguf | 46 | 6 | 6 | 3 | 6 | 3 | 7 | 5 | 6 | 4 |
gemma-7b-it.Q8_0.gguf | 44 | 6 | 7 | 6 | 5 | 4 | 5 | 2 | 6 | 3 |
gemma-2b-it.Q8_0.gguf | 36 | 3 | 7 | 6 | 3 | 2 | 2 | 4 | 6 | 3 |
phi-2.Q8_0.gguf | 26 | 6 | 5 | 5 | 3 | 3 | 1 | 2 | 1 | 0 |
qwen1_5-1_8b-chat-q8_0.gguf | 25 | 3 | 5 | 3 | 2 | 1 | 5 | 2 | 2 | 2 |
Note:
- Due to limitations of my GPU (24G VRAM), I can only run quantized models, so the performance should be lower that the original models.
- Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf
- miqu-1-70b-iq2_xs.gguf
- Smaug-34B-v0.1_Q4_K_M.gguf
- senku-70b-iq2_xxs.gguf
- openchat-3.5-0106.Q8_0.gguf
- Starling-LM-7B-beta-Q8_0.gguf
- nous-capybara-34b.Q4_K_M.gguf
- gemma-7b-it.Q8_0.gguf
- gemma-2b-it.Q8_0.gguf
- phi-2.Q8_0.gguf
- qwen1_5-1_8b-chat-q8_0.gguf
I collected 61 test questions from the Internet, it includes:
- Knowledge (7)
- Coding (8)
- Censorship (6)
- Instruction (6)
- Math (7)
- Extraction (7)
- Reasoning (10)
- Summarizing (6)
- Writing (4)
- Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf
- miqu-1-70b-iq2_xs.gguf
- Smaug-34B-v0.1_Q4_K_M.gguf
- Starling-LM-7B-beta-Q8_0.gguf
- openchat-3.5-0106.Q8_0.gguf
- senku-70b-iq2_xxs.gguf
- nous-capybara-34b.Q4_K_M.gguf
- gemma-7b-it.Q8_0.gguf
- gemma-2b-it.Q8_0.gguf
- qwen1_5-1_8b-chat-q8_0.gguf
- phi-2.Q8_0.gguf
Model | Size | Required VRAM | Required GPUs |
---|---|---|---|
miqu-1-70b-iq2_xs.gguf | 19G | 23.8G | >= RTX-3090 |
senku-70b-iq2_xxs.gguf | 20G | 22G | >= RTX-3090 |
Smaug-34B-v0.1_Q4_K_M.gguf | 20G | 23.8G | >= RTX-3090 |
nous-capybara-34b.Q4_K_M.gguf | 20G | 23.8G | >= RTX-3090 |
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf | 28.4G | >24G | >= RTX-3090 |
openchat-3.5-0106.Q8_0.gguf | 7.7G | 9.4G | >= RTX-3070 |
Starling-LM-7B-beta-Q8_0.gguf | 7.7G | 9.4G | >= RTX-3070 |
gemma-7b-it.Q8_0.gguf | 9.1G | 15G | >= RTX-3080 |
gemma-2b-it.Q8_0.gguf | >= RTX-3070 | ||
phi-2.Q8_0.gguf | >= RTX-3070 | ||
qwen1_5-1_8b-chat-q8_0.gguf | >= RTX-3070 |
- GeForce RTX 4090 (24G VRAM)
- Intel I9-14900K
- 64G RAM
- Ubuntu 22.04
- Python 3.10
- llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install -r requirements.txt
Download models from huggingface and put gguf files to models
folder.
Create model config file in models
folder, here is an example:
{
"name": "gemma-2b",
"chatFormat": "gemma",
"modelPath": "gemma-2b-it.Q8_0.gguf",
"context": 8192
}
python evaluate.py -m models/gemma-7b-it.json