Skip to content

perfectspr/llm-leaderboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Peter's LLM Leaderboard

Evaluating the capabilities of large language models is very difficult. There are already many public leaderboards that do this work. However, public leaderboards are often prone to malicious manipulation and some evaluation benchmarks are not suitable for real application scenarios. So I decided to create my own assessment benchmark and evaluate of my favourite models.

Leaderboard

Model Total Knowledge Coding Censorship Instruction Math Extraction Reasoning Summarizing Writing
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf 55 6 8 5 6 5 7 8 6 4
miqu-1-70b-iq2_xs.gguf 54 7 8 6 6 3 6 8 6 4
Smaug-34B-v0.1_Q4_K_M.gguf 53 7 8 6 5 4 6 7 6 4
Starling-LM-7B-beta-Q8_0.gguf 52 6 8 6 6 5 7 5 6 3
openchat-3.5-0106.Q8_0.gguf 52 7 8 6 6 5 7 4 6 3
senku-70b-iq2_xxs.gguf 51 6 8 6 7 5 6 4 6 3
Hermes-2-Pro-Mistral-7B.Q8_0.gguf 48 6 8 4 5 5 6 6 6 2
Nous-Hermes-2-Mistral-7B-DPO.Q8_0.gguf 46 6 8 5 4 4 6 4 6 3
nous-capybara-34b.Q4_K_M.gguf 46 6 6 3 6 3 7 5 6 4
gemma-7b-it.Q8_0.gguf 44 6 7 6 5 4 5 2 6 3
gemma-2b-it.Q8_0.gguf 36 3 7 6 3 2 2 4 6 3
phi-2.Q8_0.gguf 26 6 5 5 3 3 1 2 1 0
qwen1_5-1_8b-chat-q8_0.gguf 25 3 5 3 2 1 5 2 2 2

Note:

  • Due to limitations of my GPU (24G VRAM), I can only run quantized models, so the performance should be lower that the original models.

Detailed Results

Evaluation Questions

I collected 61 test questions from the Internet, it includes:

Download Models

Model Info

Model Size Required VRAM Required GPUs
miqu-1-70b-iq2_xs.gguf 19G 23.8G >= RTX-3090
senku-70b-iq2_xxs.gguf 20G 22G >= RTX-3090
Smaug-34B-v0.1_Q4_K_M.gguf 20G 23.8G >= RTX-3090
nous-capybara-34b.Q4_K_M.gguf 20G 23.8G >= RTX-3090
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf 28.4G >24G >= RTX-3090
openchat-3.5-0106.Q8_0.gguf 7.7G 9.4G >= RTX-3070
Starling-LM-7B-beta-Q8_0.gguf 7.7G 9.4G >= RTX-3070
gemma-7b-it.Q8_0.gguf 9.1G 15G >= RTX-3080
gemma-2b-it.Q8_0.gguf >= RTX-3070
phi-2.Q8_0.gguf >= RTX-3070
qwen1_5-1_8b-chat-q8_0.gguf >= RTX-3070

Evaluation Platform

  • GeForce RTX 4090 (24G VRAM)
  • Intel I9-14900K
  • 64G RAM
  • Ubuntu 22.04
  • Python 3.10
  • llama-cpp-python

Run in local env

1. Install Dependencies

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install -r requirements.txt

2. Download Models

Download models from huggingface and put gguf files to models folder.

3. Create model config file

Create model config file in models folder, here is an example:

{
  "name": "gemma-2b",
  "chatFormat": "gemma",
  "modelPath": "gemma-2b-it.Q8_0.gguf",
  "context": 8192
}

4. Evaluation

python evaluate.py -m models/gemma-7b-it.json

About

my LLM leaderboard

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages