Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡启动Qwen2-VL-2B-Instruct,推理时只占用一张卡显存并OOM #561

Closed
xieyongshuai opened this issue Nov 27, 2024 · 8 comments

Comments

@xieyongshuai
Copy link

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '2,3'

device = 'cuda'
model_dir = "/root/projects/models/Qwen2-VL-2B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
model_dir, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_dir)

model.to(device)
model.eval()

messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]

text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

显卡24G *2
启动时加载到两张卡,但是推理会全占用到第一张卡并且OOM

"name": "OutOfMemoryError",
"message": "CUDA out of memory. Tried to allocate 12.20 GiB. GPU 0 has a total capacity of 23.87 GiB of which 3.00 GiB is free. Process 2132557 has 208.00 MiB memory in use. Including non-PyTorch memory, this process has 18.17 GiB memory in use. Process 3246895 has 2.51 GiB memory in use. Of the allocated memory 17.52 GiB is allocated by PyTorch, and 490.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)",

如何让推理可以均摊到多卡?

@Zara7hus7ra
Copy link

same problem

1 similar comment
@czydfj
Copy link

czydfj commented Dec 13, 2024

same problem

@Stephen-K1
Copy link

用flash attention的话24g*2够了
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)

@sjghh
Copy link

sjghh commented Dec 19, 2024

same problem

@sjghh
Copy link

sjghh commented Dec 19, 2024

用flash attention的话24g*2够了 model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )

可以使用两张卡吗?

@xieyongshuai
Copy link
Author

用flash attention的话24g*2够了 model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )

只是启动吧,推理的时候第二个卡没有工作

@Stephen-K1
Copy link

用flash attention的话24g*2够了 model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )

可以使用两张卡吗?

可以

@Stephen-K1
Copy link

用flash attention的话24g*2够了 model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )

只是启动吧,推理的时候第二个卡没有工作

有工作的,两张卡的话模型分两部分加载到两张卡上,显存降了一倍左右,单卡的话需要21g左右

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants