-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce the commense results on Boolq #64
Comments
And for PIQA the result is 74.6 compared with data in table 80.7. |
Hi May I ask whether you solve this issue now? |
btw, I find that a larger batch size would lead to some bad output while bsz=1 not. |
@wutaiqiang Yes, I also find this problem and bsz=1 can solve the most case, it can still output BAD result for some case. |
In my case, the results are even better than reported. You should use one GPU in finetuning. |
69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8 boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa |
For llama 7B lora |
Hi @wutaiqiang Thanks for your data point. I try to change the base model from "float16" to "float32" or "bfloat16" and I find the output result is not very stable. |
Hi, I change the version of transformers to 4.35.0 and when doing evaluation batch_size=1. Now the results are :
|
Hi, what is the version of transformers in your case? |
4.32.1 |
Thank you so much~ I'll try again |
Hi there, I want to ask if you use the 8bit quantiztation when reproduce? |
I didn't open the 8-bit quantization |
After rerun, the results are
|
@wutaiqiang have you tried the math finetuning? |
Not yet @AaronZLT |
btw, I find the results quite unstable, try for more times and you would get quite different results |
Hi, @lucasliunju @Zhenyu001225
Does the 'bsz=1' here means the batchsize in finetuning or evaluation? Since in general I think the batchsize in evaluation won't impact the eval result, otherwise it will be very weird. The evaluation result from this repo is quite different from the lm-eval-harness (which is the official eval repo used by huggingface-open-llm-leaderboard), the result of lm-eval is poor. Btw, after finetuning on the dataset commonsense-170k, the performance on 5-shot mmlu drops, which is worse then the base model. |
Yes, bsz means the batch size. you ca refer to: |
when bsz>1, the attention_mask should also be used in the generation:
|
by the way, in evaluate.py, when using bsz>1, use_cache=False shall be removed or changed to True; otherwise, the ouput will be a mess, can anyone explain why? |
When I'm doing the evaluation, should I use --load_8bit? I'm trying to reproduce the results of LLaMa-7B-LoRA
Finetune:
CUDA_VISIBLE_DEVICES=8 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path './ft-training_set/commonsense_170k.json' --output_dir './trained_models/llama-7b-lora-commonsense/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64
Evaluate:
CUDA_VISIBLE_DEVICES=3 python commonsense_evaluate.py
--model LLaMA-7B
--adapter LoRA
--dataset boolq
--batch_size 1
--base_model 'yahma/llama-7b-hf'
--lora_weights './trained_models/llama-7b-lora-commonsense/'
But the result is only 57.5 compared with the table 68.9..
Could you provide me with some insights here?
The text was updated successfully, but these errors were encountered: