Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use bbox in SFT/微调中的定位框问题 #105

Open
mokby opened this issue Sep 4, 2024 · 8 comments
Open

How to use bbox in SFT/微调中的定位框问题 #105

mokby opened this issue Sep 4, 2024 · 8 comments

Comments

@mokby
Copy link

mokby commented Sep 4, 2024

我在观察Qwen2-vl的SFT数据格式时发现似乎和Qwen-vl的格式差别比较大,重点是没有给出定位框的标注示例了,还有就是0-1000的归一化问题,再Qwen2-vl中还需要操作么?求解答
When observing the SFT data format of Qwen2-vl , I found that there seems to be a big difference with the format of Qwen-vl, focusing on the labeling example of the bounding box is not given. And the problem of the normalization of 0-1000, does Qwen2-vl still remain? Please I need an answer, thx.

@ShuaiBai623
Copy link
Collaborator

#9

@mokby
Copy link
Author

mokby commented Sep 6, 2024

#9

这个帖子无法回答我的问题,里面没有提到怎么做SFT的框标注
This issue can not sovle my problem, it is note mentioned how to generate bbox in SFT dataset.

@kq-chen
Copy link
Collaborator

kq-chen commented Sep 6, 2024

the grounding SFT data can be organized in the following format, where special tokens like <|object_ref_start|>, <|object_ref_end|>, <|box_start|>, <|box_end|> ... correspond to <ref>, </ref>, <box> </box> tokens ... used in the Qwen-VL paper. Apart from the renaming of these special characters, all other details are consistent with the implementation in Qwen-VL, including the normalization of coordinates, etc.

# Caption with Grounding
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "text",
                "text": "Generate the caption with grounding:",
            },
        ],
    },
    {"role": "assistant", "content": "In the image,<|object_ref_start|>a woman<|object_ref_end|><|box_start|>(462,371),(709,785)<|box_end|> is sitting on the beach with<|object_ref_start|> her dog<|object_ref_end|><|box_start|>(246,418),(586,875)<|box_end|>. The dog is wearing<|object_ref_start|> a harness<|object_ref_end|><|box_start|>(369,531),(496,704)<|box_end|> and appears to be giving<|object_ref_start|> its paw<|object_ref_end|><|box_start|>(508,512),(588,587)<|box_end|> to the woman in a high-five gesture. The scene is set during sunset, with the sun casting a warm glow over the ocean and the sand."},
],
# Visual Grounding
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "text",
                "text": "Output the bounding box of the person in the image.",
            },
        ],
    },
    {"role": "assistant", "content": "<|object_ref_start|>person<|object_ref_end|><|box_start|>(465,372),(710,789)<|box_end|><|im_end|>"},
],
# Grounded Captioning
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "text",
                "text": "Describe the region <|box_start|>(462,371),(709,785)<|box_end|>.",
            },
        ],
    },
    {"role": "assistant", "content": "The region in the image shows a person sitting on the beach with their dog. The person is wearing a plaid shirt and dark pants, and they appear to be enjoying a moment with their dog. The dog is wearing a harness and is giving its paw to the person. The background features the ocean with waves gently crashing onto the shore. The lighting suggests it is either sunrise or sunset, creating a warm and serene atmosphere."},
],
# OCR with Grounding
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://github.com/QwenLM/Qwen-VL/blob/master/assets/demo_highfive.jpg?raw=true",
            },
            {
                "type": "text",
                "text": "Output OCR Results with Grounding:",
            },
        ],
    },
    {"role": "assistant", "content": "<|object_ref_start|>击掌<|object_ref_end|><|quad_start|>(502,500),(584,500),(584,558),(502,558)<|quad_end|>"},
]

@mokby
Copy link
Author

mokby commented Sep 9, 2024

似乎您提供的格式并不是sharegpt格式的数据集,和官方提供的有些出入,这是自定义的数据集格式么?LLaMA-Factory能够支持么?
It seems that the format you offered is not the offical sharegpt format, how can i use through LLaMA-Factory? Do i need to modify the dataset_info.json in LLaMA-Factory?

@Guangming92
Copy link

可否加个联系方式,交流一下?微q695914059

@ShuaiBai623
Copy link
Collaborator

Thank you for your support of Qwen2_VL. The main reason for the poor fine-tuning grounding performance is that the 1D rope position embedding, rather than MRoPE, was used during training. We will provide a solution as soon as possible, and we apologize for the inconvenience caused.

@mokby
Copy link
Author

mokby commented Sep 14, 2024

Thanks for your jobs, we are looking forward for your work.

@zxjzzz
Copy link

zxjzzz commented Dec 26, 2024

Thank you for your support of Qwen2_VL. The main reason for the poor fine-tuning grounding performance is that the 1D rope position embedding, rather than MRoPE, was used during training. We will provide a solution as soon as possible, and we apologize for the inconvenience caused.

你好 请问现在对qwen-vl微调 针对目标检测任务有很好的效果吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants