多卡启动Qwen2-VL-2B-Instruct，推理时只占用一张卡显存并OOM #561

xieyongshuai · 2024-11-27T08:01:27Z

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '2,3'

device = 'cuda'
model_dir = "/root/projects/models/Qwen2-VL-2B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
model_dir, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_dir)

model.to(device)
model.eval()

messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]

text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

显卡24G *2
启动时加载到两张卡，但是推理会全占用到第一张卡并且OOM

"name": "OutOfMemoryError",
"message": "CUDA out of memory. Tried to allocate 12.20 GiB. GPU 0 has a total capacity of 23.87 GiB of which 3.00 GiB is free. Process 2132557 has 208.00 MiB memory in use. Including non-PyTorch memory, this process has 18.17 GiB memory in use. Process 3246895 has 2.51 GiB memory in use. Of the allocated memory 17.52 GiB is allocated by PyTorch, and 490.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)",

如何让推理可以均摊到多卡？

Zara7hus7ra · 2024-12-10T07:22:28Z

same problem

czydfj · 2024-12-13T03:56:34Z

same problem

Stephen-K1 · 2024-12-18T07:05:00Z

用flash attention的话24g*2够了
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)

sjghh · 2024-12-19T13:30:25Z

same problem

sjghh · 2024-12-19T13:55:34Z

用flash attention的话24g*2够了 model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )

可以使用两张卡吗？

xieyongshuai · 2024-12-20T01:38:02Z

用flash attention的话24g*2够了 model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )

只是启动吧，推理的时候第二个卡没有工作

Stephen-K1 · 2024-12-20T05:27:04Z

用flash attention的话24g*2够了 model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )

可以使用两张卡吗？

可以

Stephen-K1 · 2024-12-20T05:33:08Z

用flash attention的话24g*2够了 model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )

只是启动吧，推理的时候第二个卡没有工作

有工作的，两张卡的话模型分两部分加载到两张卡上，显存降了一倍左右，单卡的话需要21g左右

xieyongshuai closed this as completed Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡启动Qwen2-VL-2B-Instruct，推理时只占用一张卡显存并OOM #561

多卡启动Qwen2-VL-2B-Instruct，推理时只占用一张卡显存并OOM #561

xieyongshuai commented Nov 27, 2024

Zara7hus7ra commented Dec 10, 2024

czydfj commented Dec 13, 2024

Stephen-K1 commented Dec 18, 2024

sjghh commented Dec 19, 2024

sjghh commented Dec 19, 2024

xieyongshuai commented Dec 20, 2024

Stephen-K1 commented Dec 20, 2024

Stephen-K1 commented Dec 20, 2024

多卡启动Qwen2-VL-2B-Instruct，推理时只占用一张卡显存并OOM #561

多卡启动Qwen2-VL-2B-Instruct，推理时只占用一张卡显存并OOM #561

Comments

xieyongshuai commented Nov 27, 2024

Zara7hus7ra commented Dec 10, 2024

czydfj commented Dec 13, 2024

Stephen-K1 commented Dec 18, 2024

sjghh commented Dec 19, 2024

sjghh commented Dec 19, 2024

xieyongshuai commented Dec 20, 2024

Stephen-K1 commented Dec 20, 2024

Stephen-K1 commented Dec 20, 2024