Reproduce the commense results on Boolq #64

Zhenyu001225 · 2024-04-09T14:53:53Z

When I'm doing the evaluation, should I use --load_8bit? I'm trying to reproduce the results of LLaMa-7B-LoRA

Finetune:
CUDA_VISIBLE_DEVICES=8 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path './ft-training_set/commonsense_170k.json' --output_dir './trained_models/llama-7b-lora-commonsense/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

Evaluate:
CUDA_VISIBLE_DEVICES=3 python commonsense_evaluate.py
--model LLaMA-7B
--adapter LoRA
--dataset boolq
--batch_size 1
--base_model 'yahma/llama-7b-hf'
--lora_weights './trained_models/llama-7b-lora-commonsense/'

But the result is only 57.5 compared with the table 68.9..
Could you provide me with some insights here?

The text was updated successfully, but these errors were encountered:

Zhenyu001225 · 2024-04-09T15:20:44Z

And for PIQA the result is 74.6 compared with data in table 80.7.
For Siqa the result is 60.8 compared with data in table 77.4
Should I finetune again? Or adjusting any of the hypermeters

lucasliunju · 2024-04-16T03:05:25Z

Hi May I ask whether you solve this issue now?

wutaiqiang · 2024-04-16T05:34:48Z

btw, I find that a larger batch size would lead to some bad output while bsz=1 not.

lucasliunju · 2024-04-16T10:28:11Z

@wutaiqiang Yes, I also find this problem and bsz=1 can solve the most case, it can still output BAD result for some case.

wutaiqiang · 2024-04-16T10:45:43Z

In my case, the results are even better than reported. You should use one GPU in finetuning.

wutaiqiang · 2024-04-16T10:48:20Z

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8

wutaiqiang · 2024-04-16T10:48:37Z

For llama 7B lora

lucasliunju · 2024-04-16T11:03:13Z

Hi @wutaiqiang Thanks for your data point. I try to change the base model from "float16" to "float32" or "bfloat16" and I find the output result is not very stable.

Zhenyu001225 · 2024-04-16T13:57:02Z

Hi May I ask whether you solve this issue now?

Hi, I change the version of transformers to 4.35.0 and when doing evaluation batch_size=1.

Now the results are :

Model	Gsm8k	SVAMP	AuQA	MultiArith	SingleEq	AddSub
LLama-7B-LoRA-math	37.9	47.0	19.68	97.5	85.83	83.54

Model	BoolQ	SiQA	SIQA	Hellaswag	Winogrande	ARC-c	ARC-e	OpenBookQA	Average
LLama-7B-LoRA-Commense	64.01	80.25	77.28	76.50	79.79	62.54	77.31	77.4	74.39

Zhenyu001225 · 2024-04-16T13:58:03Z

For llama 7B lora

Hi, what is the version of transformers in your case?

wutaiqiang · 2024-04-16T14:08:24Z

4.32.1

Zhenyu001225 · 2024-04-16T14:30:52Z

4.32.1

Thank you so much~ I'll try again

clarenceluo78 · 2024-04-16T23:00:21Z

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8

boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

Hi there, I want to ask if you use the 8bit quantiztation when reproduce?

Zhenyu001225 · 2024-04-17T00:00:53Z

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8
boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

Hi there, I want to ask if you use the 8bit quantiztation when reproduce?

I didn't open the 8-bit quantization

wutaiqiang · 2024-04-17T01:24:07Z

After rerun, the results are

68.13	80.3	78.45	83.11	80.66	77.23	65.78	79.4
boolq	piqa	social_i_qa	hellaswag	winogrande	ARC-Easy	ARC-Challenge	openbookqa

AaronZLT · 2024-06-14T02:46:59Z

@wutaiqiang have you tried the math finetuning?

wutaiqiang · 2024-06-19T06:57:17Z

Not yet @AaronZLT

wutaiqiang · 2024-06-19T06:58:07Z

btw, I find the results quite unstable, try for more times and you would get quite different results

AaronZLT · 2024-06-19T09:57:25Z

Hi, @lucasliunju @Zhenyu001225

@wutaiqiang Yes, I also find this problem and bsz=1 can solve the most case, it can still output BAD result for some case.

Does the 'bsz=1' here means the batchsize in finetuning or evaluation? Since in general I think the batchsize in evaluation won't impact the eval result, otherwise it will be very weird. The evaluation result from this repo is quite different from the lm-eval-harness (which is the official eval repo used by huggingface-open-llm-leaderboard), the result of lm-eval is poor.

Btw, after finetuning on the dataset commonsense-170k, the performance on 5-shot mmlu drops, which is worse then the base model.

wutaiqiang · 2024-06-19T11:01:41Z

Yes, bsz means the batch size.

you ca refer to:
https://openreview.net/pdf?id=9MDjKb9lGi

@AaronZLT

wutaiqiang · 2024-06-19T11:04:28Z

Also this one:

https://www.reddit.com/r/LocalLLaMA/comments/19dn2to/inconsistencies_in_llm_outputs_single_vs_batched/

Leopold1423 · 2024-10-12T06:29:03Z

when bsz>1, the attention_mask should also be used in the generation:

        inputs = tokenizer(prompts, return_tensors="pt", padding=True)
        input_ids = inputs["input_ids"].to(device)
        generation_config = GenerationConfig(
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            **kwargs,
        )
        with torch.no_grad():
            generation_output = model.generate(
                input_ids=input_ids,
                attention_mask=inputs['attention_mask'].to(device),   # add attention mask here 
                generation_config=generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=max_new_tokens,
            )
        s = generation_output.sequences
        outputs = tokenizer.batch_decode(s, skip_special_tokens=True)

Leopold1423 · 2024-10-12T08:06:33Z

when bsz>1, the attention_mask should also be used in the generation:

        inputs = tokenizer(prompts, return_tensors="pt", padding=True)
        input_ids = inputs["input_ids"].to(device)
        generation_config = GenerationConfig(
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            **kwargs,
        )
        with torch.no_grad():
            generation_output = model.generate(
                input_ids=input_ids,
                attention_mask=inputs['attention_mask'].to(device),   # add attention mask here 
                generation_config=generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=max_new_tokens,
            )
        s = generation_output.sequences
        outputs = tokenizer.batch_decode(s, skip_special_tokens=True)

by the way, in evaluate.py, when using bsz>1, use_cache=False shall be removed or changed to True; otherwise, the ouput will be a mess, can anyone explain why?

wutaiqiang mentioned this issue Oct 12, 2024

Question about the GenerationConfig in commonsense_evaluate.py wutaiqiang/MoSLoRA#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce the commense results on Boolq #64

Reproduce the commense results on Boolq #64

Zhenyu001225 commented Apr 9, 2024 •

edited

Loading

Zhenyu001225 commented Apr 9, 2024

lucasliunju commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

lucasliunju commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

lucasliunju commented Apr 16, 2024

Zhenyu001225 commented Apr 16, 2024

Zhenyu001225 commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

Zhenyu001225 commented Apr 16, 2024

clarenceluo78 commented Apr 16, 2024

Zhenyu001225 commented Apr 17, 2024

wutaiqiang commented Apr 17, 2024

AaronZLT commented Jun 14, 2024

wutaiqiang commented Jun 19, 2024

wutaiqiang commented Jun 19, 2024

AaronZLT commented Jun 19, 2024 •

edited

Loading

wutaiqiang commented Jun 19, 2024 •

edited

Loading

wutaiqiang commented Jun 19, 2024

Leopold1423 commented Oct 12, 2024 •

edited

Loading

Leopold1423 commented Oct 12, 2024

Reproduce the commense results on Boolq #64

Reproduce the commense results on Boolq #64

Comments

Zhenyu001225 commented Apr 9, 2024 • edited Loading

Zhenyu001225 commented Apr 9, 2024

lucasliunju commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

lucasliunju commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

lucasliunju commented Apr 16, 2024

Zhenyu001225 commented Apr 16, 2024

Zhenyu001225 commented Apr 16, 2024

wutaiqiang commented Apr 16, 2024

Zhenyu001225 commented Apr 16, 2024

clarenceluo78 commented Apr 16, 2024

Zhenyu001225 commented Apr 17, 2024

wutaiqiang commented Apr 17, 2024

AaronZLT commented Jun 14, 2024

wutaiqiang commented Jun 19, 2024

wutaiqiang commented Jun 19, 2024

AaronZLT commented Jun 19, 2024 • edited Loading

wutaiqiang commented Jun 19, 2024 • edited Loading

wutaiqiang commented Jun 19, 2024

Leopold1423 commented Oct 12, 2024 • edited Loading

Leopold1423 commented Oct 12, 2024

Zhenyu001225 commented Apr 9, 2024 •

edited

Loading

AaronZLT commented Jun 19, 2024 •

edited

Loading

wutaiqiang commented Jun 19, 2024 •

edited

Loading

Leopold1423 commented Oct 12, 2024 •

edited

Loading