-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多卡启动Qwen2-VL-2B-Instruct,推理时只占用一张卡显存并OOM #561
Comments
same problem |
1 similar comment
same problem |
用flash attention的话24g*2够了 |
same problem |
可以使用两张卡吗? |
只是启动吧,推理的时候第二个卡没有工作 |
可以 |
有工作的,两张卡的话模型分两部分加载到两张卡上,显存降了一倍左右,单卡的话需要21g左右 |
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '2,3'
device = 'cuda'
model_dir = "/root/projects/models/Qwen2-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_dir, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_dir)
model.to(device)
model.eval()
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
显卡24G *2
启动时加载到两张卡,但是推理会全占用到第一张卡并且OOM
"name": "OutOfMemoryError",
"message": "CUDA out of memory. Tried to allocate 12.20 GiB. GPU 0 has a total capacity of 23.87 GiB of which 3.00 GiB is free. Process 2132557 has 208.00 MiB memory in use. Including non-PyTorch memory, this process has 18.17 GiB memory in use. Process 3246895 has 2.51 GiB memory in use. Of the allocated memory 17.52 GiB is allocated by PyTorch, and 490.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)",
如何让推理可以均摊到多卡?
The text was updated successfully, but these errors were encountered: