Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

This work (GaC) allows multiple heterogeneous LLMs to ensemble at each generation step and collectively decide the next token. We take the union of the vocabularies of all LLMs participating in the ensemble and, at each generation step, map the probability vectors generated by each LLM to the union vocab. We then compute the (weighted) average to determine the next token. Experiments show that this simple method can break the ceiling of the open-source LLM community, allowing the ensemble result to outperform any single state-of-the-art LLM. [Paper]

We provide two types of ensemble methods:

Every Step Ensemble: All LLMs ensemble at each generation step.
Thresholded Ensemble: A primary(gate) LLM is specified, and ensemble is performed only if the maximum confidence token of the primary LLM at a step is below a threshold, thereby saving computational resources.

We support parallel execution of the LLMs involved in the ensemble to save time. If each LLM is allocated to a different GPU, the latency of ensembling is almost the same as using a single LLM, all managed by Ray.

GaC Ensemble Results

Id	Models	MMLU	GSM8K	BBH	TriviaQA	NQ	Avg.	Date	Latency
1	Yi-34B-Chat	72.75	68.76	50.88	70.01	29.81	58.44	2023/11/08	67.96 ms/token
2	Mixtral-8x7B-Instruct-v0.1	70.89	66.82	49.84	76.54	34.35	59.69	2023/12/11	96.64 ms/token
3	Qwen1.5-72B-Chat	77.79	83.33	48.94	65.69	27.02	60.55	2024/02/04	102.11 ms/token
4	Llama-3-70B-Instruct	79.68	90.00	57.13	79.12	35.57	68.30	2024/04/18	150.32 ms/token
5	Qwen2-72B-Instruct	82.30	89.70	62.57	73.58	33.11	68.25	2024/06/07	113.91 ms/token
6	GaC(Yi Mixtral)	74.83	71.21	52.64	75.60	33.52	61.56 ↑3.13%	~2023/12/11	98.13 ms/token
7	GaC(Qwen1.5-72B Yi)	79.83	77.27	52.05	70.88	33.80	62.77 ↑3.65%	~2024/02/04	103.69 ms/token
8	GaC(Qwen1.5-72B Mixtral)	79.55	75.76	54.19	75.71	31.09	63.26 ↑4.47%	~2024/02/04	112.83 ms/token
9	GaC(Llama-3 Qwen1.5-72B)	81.49	87.06	56.73	78.60	36.01	67.98 ↓0.47%	~2024/04/18	153.96 ms/token
10	GaC(Qwen2-72B Llama-3)	83.54	90.91	63.99	79.29	37.65	71.08 ↑4.06%	~2024/06/07	151.56 ms/token

Note: Ensemble of available SOTA LLMs from different periods. The top part lists the individual models, while the bottom part shows the ensemble results (model names abbreviated). ↑ indicates the percentage improvement over the individual models.

System Requirements

Operating System: Ubuntu 20.04. We have not tested on Windows.
Python Version: 3.11
GPU: Ensure that your GPU has enough RAM to load all the models you want to ensemble.
Environment Management Tool: Anaconda or any other suitable tool

Get Started

Create a New Environment

Open your terminal.
Create and activate a new conda environment:

conda create -n gac_env python=3.11
conda activate gac_env

Install GaC Required Packages

cd [root-of-this-repo]/GaC
pip install -r requirements.txt

Launch GaC Server

We have integrated our work into an API server, which can be configured with a YAML file at startup to determine which LLMs to use for ensembling. An example is shown below:

NORM_TYPE_API_SERVER: 'average' # 'average' or 'score'
THRESHOLD_API_SERVER: 1.0
CONFIG_API_SERVER:
  - weight: '[Please replace with the path to the local model weight]' # or 'upstage/SOLAR-10.7B-Instruct-v1.0'
    max_memory:
      0: '24GiB'
    num_gpus: 0.5
    name: 'SOLAR-10.7B-Instruct-v1.0'
    score: 100
    priority: 'supportive' # 'primary' or 'supportive'
  
  - weight: '[Please replace with the path to the local model weight]' # or 'openchat/openchat-3.5-0106'
    max_memory:
      0: '24GiB'
    num_gpus: 0.5
    name: 'openchat-3.5-0106'
    score: 100
    priority: 'supportive' # 'primary' or 'supportive'

Note: Please ensure that the number of GPUs on your computer is greater than the sum of all num_gpus values, and that the max_memory index for each model always starts from 0 (you can assume each model runs on an independent machine managed by Ray).

Explanation of Parameters

CONFIG_API_SERVER: List of models to be used in the ensemble. Each model configuration includes:
- weight: Local path to the model weight. You can also choose to use the Hugging Face model card name to download automatically.
- max_memory: Controls how much memory each GPU uses. Since each model is managed independently by Ray, the GPU IDs always start from 0. For example, if you set num_gpus to 2, you should allocate the maximum memory for each GPU, such as {0: 'xxGiB', 1: 'xxGiB'}.
- num_gpus: Number of GPUs allocated to this model. Controlled by Ray. To load two models on one GPU, set num_gpus to 0.5 for both models. So, a total of 0.5 0.5=1 GPU will be used in this case.
- priority: If all models are 'supportive', the ensemble will be performed at every generation step. For threshold-based ensembling, set the gate model's priority to "primary".
NORM_TYPE_API_SERVER: Ensemble weight type, 'average' or 'score'. 'Score' means each model's output vector in the GaC ensemble is weighted by its score divided by the total score.
THRESHOLD_API_SERVER: Threshold for ensemble. This parameter is ineffective if all models are supportive.

Examples and Tested Models

We have listed examples of ensembling SOLAR-10.7B-Instruct-v1.0 and openchat-3.5-0106 under example_configs/:

example_ensemble_every_step.yaml: Ensembles at every generation step, ensuring each model's priority is 'supportive'. THRESHOLD_API_SERVER will be ignored.
example_thresholded_ensemble.yaml: Only ensembles at a generation step if the primary model's highest confidence token is below THRESHOLD_API_SERVER.

Additionally, we have listed the models that have been tested in tested_models.yaml. However, this does not mean that the latest models not included in the list won't work; it just means we do not guarantee them.

To start the GaC server, use the following command in your terminal:

python gac_api_server.py --config-path [path-to-your-config-file.yaml] --host 0.0.0.0 --port 8000

Run GaC Ensemble

After setting up the API, you can directly execute the GaC ensemble by making calls as demonstrated in call.py. Here’s an explanation of the key parameters:

messages_list: A list of messages, where each messages represent a conversation. Each messages can contain multiple sub-messages in the format {"role": "..", "content": ".."}. The value of "role" can be system, user, or assistant. The list containing n messages indicates a batch size of n. However, for thresholded ensembles, only a batch size of 1 is currently supported.
max_new_tokens: An integer that determines the maximum number of tokens that can be generated.
apply_chat_template: A boolean. If True, each model will assemble the messages into a specific template according to the "chat_template" Jinja template defined in its tokenizer_config.json.

Here’s an example code:

import requests

url = "http://0.0.0.0:8000/api/generate/"

data = {
    "messages_list": [
	# Conversation 1
        [{"role": "user", "content": "9.11 and 9.9, which is bigger?"},
         {"role": "assistant", "content": "..."},
         {"role": "user", "content": "..."}],
        # Conversation 2
        [{"role": "user", "content": "How are you?"}]                     
    ],
    "max_new_tokens": 1024,
    "apply_chat_template": True,
}

response = requests.post(url, json=data)
print(response.json())

Citation

@misc{yu2024breaking,
  title={Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling},
  author={Yu, Yao-Ching and Kuo, Chun-Chih and Ye, Ziqi and Chang, Yu-Cheng and Li, Yueh-Se},
  year={2024},
  eprint={2406.12585},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
example_configs		example_configs
pics		pics
utils		utils
LICENSE		LICENSE
README.md		README.md
call.py		call.py
gac_api_server.py		gac_api_server.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

GaC Ensemble Results

Table of Contents

System Requirements

Get Started

Create a New Environment

Install GaC Required Packages

Launch GaC Server

Run GaC Ensemble

Citation

About

Releases

Packages

Languages

License

yaoching0/GaC

Folders and files

Latest commit

History

Repository files navigation

Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

GaC Ensemble Results

Table of Contents

System Requirements

Get Started

Create a New Environment

Install GaC Required Packages

Launch GaC Server

Run GaC Ensemble

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages