Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce the commense results on Boolq #64

Open
Zhenyu001225 opened this issue Apr 9, 2024 · 23 comments
Open

Reproduce the commense results on Boolq #64

Zhenyu001225 opened this issue Apr 9, 2024 · 23 comments

Comments

@Zhenyu001225
Copy link

Zhenyu001225 commented Apr 9, 2024

When I'm doing the evaluation, should I use --load_8bit? I'm trying to reproduce the results of LLaMa-7B-LoRA

Finetune:
CUDA_VISIBLE_DEVICES=8 python finetune.py --base_model 'yahma/llama-7b-hf' --data_path './ft-training_set/commonsense_170k.json' --output_dir './trained_models/llama-7b-lora-commonsense/' --batch_size 16 --micro_batch_size 4 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 120 --eval_step 80 --save_step 80 --adapter_name lora --target_modules '["q_proj", "k_proj", "v_proj", "up_proj", "down_proj"]' --lora_r 32 --lora_alpha 64

Evaluate:
CUDA_VISIBLE_DEVICES=3 python commonsense_evaluate.py
--model LLaMA-7B
--adapter LoRA
--dataset boolq
--batch_size 1
--base_model 'yahma/llama-7b-hf'
--lora_weights './trained_models/llama-7b-lora-commonsense/'

But the result is only 57.5 compared with the table 68.9..
Could you provide me with some insights here?

@Zhenyu001225
Copy link
Author

And for PIQA the result is 74.6 compared with data in table 80.7.
For Siqa the result is 60.8 compared with data in table 77.4
Should I finetune again? Or adjusting any of the hypermeters

@lucasliunju
Copy link

Hi May I ask whether you solve this issue now?

@wutaiqiang
Copy link

btw, I find that a larger batch size would lead to some bad output while bsz=1 not.

@lucasliunju
Copy link

@wutaiqiang Yes, I also find this problem and bsz=1 can solve the most case, it can still output BAD result for some case.

@wutaiqiang
Copy link

In my case, the results are even better than reported. You should use one GPU in finetuning.

@wutaiqiang
Copy link

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8

boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

@wutaiqiang
Copy link

For llama 7B lora

@lucasliunju
Copy link

Hi @wutaiqiang Thanks for your data point. I try to change the base model from "float16" to "float32" or "bfloat16" and I find the output result is not very stable.

@Zhenyu001225
Copy link
Author

Hi May I ask whether you solve this issue now?

Hi, I change the version of transformers to 4.35.0 and when doing evaluation batch_size=1.

Now the results are :

Model Gsm8k SVAMP AuQA MultiArith SingleEq AddSub
LLama-7B-LoRA-math 37.9 47.0 19.68 97.5 85.83 83.54
Model BoolQ SiQA SIQA Hellaswag Winogrande ARC-c ARC-e OpenBookQA Average
LLama-7B-LoRA-Commense 64.01 80.25 77.28 76.50 79.79 62.54 77.31 77.4 74.39

@Zhenyu001225
Copy link
Author

For llama 7B lora

Hi, what is the version of transformers in your case?

@wutaiqiang
Copy link

4.32.1

@Zhenyu001225
Copy link
Author

4.32.1

Thank you so much~ I'll try again

@clarenceluo78
Copy link

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8

boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

Hi there, I want to ask if you use the 8bit quantiztation when reproduce?

@Zhenyu001225
Copy link
Author

69.44 | 80.79 | 79.32 | 84.2 | 81.61 | 80.34 | 64.93 | 76.8
boolq | piqa | social_i_qa | hellaswag | winogrande | ARC-Easy | ARC-Challenge | openbookqa

Hi there, I want to ask if you use the 8bit quantiztation when reproduce?

I didn't open the 8-bit quantization

@wutaiqiang
Copy link

After rerun, the results are

68.13 80.3 78.45 83.11 80.66 77.23 65.78 79.4
boolq piqa social_i_qa hellaswag winogrande ARC-Easy ARC-Challenge openbookqa

@AaronZLT
Copy link

@wutaiqiang have you tried the math finetuning?

@wutaiqiang
Copy link

Not yet @AaronZLT

@wutaiqiang
Copy link

btw, I find the results quite unstable, try for more times and you would get quite different results

@AaronZLT
Copy link

AaronZLT commented Jun 19, 2024

Hi, @lucasliunju @Zhenyu001225

@wutaiqiang Yes, I also find this problem and bsz=1 can solve the most case, it can still output BAD result for some case.

Does the 'bsz=1' here means the batchsize in finetuning or evaluation? Since in general I think the batchsize in evaluation won't impact the eval result, otherwise it will be very weird. The evaluation result from this repo is quite different from the lm-eval-harness (which is the official eval repo used by huggingface-open-llm-leaderboard), the result of lm-eval is poor.

Btw, after finetuning on the dataset commonsense-170k, the performance on 5-shot mmlu drops, which is worse then the base model.

@wutaiqiang
Copy link

wutaiqiang commented Jun 19, 2024

Yes, bsz means the batch size.

you ca refer to:
https://openreview.net/pdf?id=9MDjKb9lGi

@AaronZLT

@Leopold1423
Copy link

Leopold1423 commented Oct 12, 2024

when bsz>1, the attention_mask should also be used in the generation:

        inputs = tokenizer(prompts, return_tensors="pt", padding=True)
        input_ids = inputs["input_ids"].to(device)
        generation_config = GenerationConfig(
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            **kwargs,
        )
        with torch.no_grad():
            generation_output = model.generate(
                input_ids=input_ids,
                attention_mask=inputs['attention_mask'].to(device),   # add attention mask here 
                generation_config=generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=max_new_tokens,
            )
        s = generation_output.sequences
        outputs = tokenizer.batch_decode(s, skip_special_tokens=True)

@Leopold1423
Copy link

when bsz>1, the attention_mask should also be used in the generation:

        inputs = tokenizer(prompts, return_tensors="pt", padding=True)
        input_ids = inputs["input_ids"].to(device)
        generation_config = GenerationConfig(
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            **kwargs,
        )
        with torch.no_grad():
            generation_output = model.generate(
                input_ids=input_ids,
                attention_mask=inputs['attention_mask'].to(device),   # add attention mask here 
                generation_config=generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=max_new_tokens,
            )
        s = generation_output.sequences
        outputs = tokenizer.batch_decode(s, skip_special_tokens=True)

by the way, in evaluate.py, when using bsz>1, use_cache=False shall be removed or changed to True; otherwise, the ouput will be a mess, can anyone explain why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants