Skip to content

[Under Review] Official PyTorch implementation code for realizing the technical part of Phantom of Latent representing equipped with enlarged hidden dimension to build super frontier vision language models.

License

Notifications You must be signed in to change notification settings

ByungKwanLee/Phantom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Phantom of Latent for Large Language and Vision Models

πŸ“° News

  • Phantom-0.5B|1.8B|3.8B/7B has been released in πŸ€— Huggingface Models.
  • Preprint of Phantom has been uploaded in ArXiv.
  • Phantom Triples Dataset for DPO-like concept has been released in πŸ€— Huggingface Datasets.
  • The demo code of Phantom-0.5B|1.8B|3.8B|7B has been updated in this repository.
  • Online demo for Phantom-0.5B|1.8B|3.8B|7B has been released in πŸ€— Huggingface Spaces.
  • The code of fintuning Phantom-0.5B|1.8B|3.8B|7B will be soon updated in this repository.

Official PyTorch implementation code for realizing the technical part of Phantom of Latent to improve numerous vision language performances with efficient model size. This code is developed from scratch, where the model architecture and all configurations are inspired by InternVL. I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.

ezgif-1-1adc110e50

πŸ’‘ Highlights, Preview of Papers

Figure1.

Figure2.

Figure3.

πŸšͺ How to run the local demo?

import torch
from config import *
from PIL import Image
from utils.utils import *
from model.load_model import load_model
from torchvision.transforms.functional import pil_to_tensor


# model selection
size = '7b' # [Select One] '0.5b' (transformers more recent version) | '1.8b' | '3.8b' (transformers==4.37.2) | '7b'

# User prompt
prompt_type="with_image" # Select one option "text_only", "with_image"
img_path='figures/demo.png'
question="Describe the image in detail"

# loading model
model, tokenizer = load_model(size=size)

# prompt type -> input prompt
if prompt_type == 'with_image':
    # Image Load
    image = pil_to_tensor(Image.open(img_path).convert("RGB"))
    inputs = [{'image': image, 'question': question}]
elif prompt_type=='text_only':
    inputs = [{'question': question}]

# cpu -> gpu
for param in model.parameters():
    if not param.is_cuda:
        param.data = param.cuda()

# Generate
with torch.inference_mode():

    # Model
    _inputs = model.eval_process(inputs=inputs,
                                data='demo',
                                tokenizer=tokenizer,
                                device='cuda:0')
    generate_ids = model.generate(**_inputs, do_sample=False, max_new_tokens=256)
answer = tokenizer.batch_decode(generate_ids, skip_special_tokens=True)[0]
print(answer)

πŸ“‹ Gathered Dataset Description

Dataset Description (Total: 2852771, 2.8M)

------------------------------
* Real-World Image: 1218630, 1.2M
* Real-World Text: 143000, 143K
* Document & Chart & Diagram & Sign & Symbol: 743850, 744k
* Math: 747291, 747k
    - Math with Vision: 180497, 180k
    - Math with Text only: 566794, 566k
------------------------------

- ShareGPT4O-Images (57289, 57k)
- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- ALLAVA4V-VFLAN based on MiniGemini-Pretrain/Instruct (405617, 405k)
- ALLAVA4V-Text (143000, 143k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- SMR [ArXivQA, TextbookQA] (116035, 116K)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)

πŸ”₯ Curated Phantom Dataset Description

Dataset Description (Total 2040186, 2.0M)

--------------------------------------------
* Real-World Image: 871160, 871k
* Real-World Text: 102389, 102k
* Document & Chart & Diagram & Sign & Symbol: 529709, 529k
* Math: 536928, 536k
    - Math with Vision: 129694, 129k
    - Math with Text only: 407234, 407k
--------------------------------------------

- ShareGPT4O-Images (40106, 40k)
- ShareGPT4V-Caption [without SAM] (64925, 64k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (475669, 475k)
- ALLAVA4V-VFLAN based on MiniGemini-Pretrain/Instruct (290460, 290k)
- ALLAVA4V-Text (102389, 102k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (19363, 19k)
- SMR [ArXivQA, TextbookQA] (82843, 82K)
- DocDownstream (409140, 409k)
- DocReason (18363, 18k)
- GLLaVA (127484, 127k)
- MathVision (2210, 2k)
- MathInstruct [TextOnlyDataset] (188288, 188k)
- MathPlus [TextOnlyDataset] (218946, 218k)

πŸš€ Download Training Datasets

We collect the following eight datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.

Gathered Dataset Layout

Phantom_Dataset_Path
β”œβ”€β”€ llava                                                       # ShareGPT4V
β”‚   └── llava_pretrain                  
β”‚       └── images                  
β”œβ”€β”€ coco                                                        # ShareGPT4V
β”‚   └── train2017                   
β”œβ”€β”€ sam                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ gqa                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ ocr_vqa                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ textvqa                                                     # ShareGPT4V
β”‚   └── train_images                    
β”œβ”€β”€ vg                                                          # ShareGPT4V
β”‚   β”œβ”€β”€ VG_100K                 
β”‚   └── VG_100K_2                   
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-celebrity                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-landmark                                                # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ wikiart                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ docvqa                                                      # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ chartqa                                                     # MiniGemini
β”‚   └── train                   
β”‚       └── images                  
β”œβ”€β”€ dvqa                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ ai2d                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ ALLaVA-4V                                                   # MiniGemini (ALLAVA-VFLAN)
β”‚   └── allava_vflan
β”‚       └── images
β”œβ”€β”€ arxivqa                                                     # SMR (ArXivQA)
β”‚   └── images
β”œβ”€β”€ TextbookQA                                                  # SMR (TextbookQA)
β”‚   └── train
β”‚   └── val 
β”œβ”€β”€ imgs                                                        # DocDownstream & DocReason
β”‚   └── ChartQA
β”‚   └── DUE_Benchmark
β”‚       └── DeepForm
β”‚       └── DocVQA
β”‚       └── InfographicsVQA
β”‚       └── KleisterCharity
β”‚       └── TabFact
β”‚       └── WikiTableQuestions
β”‚   └── TextCaps
β”‚   └── TextVQA
β”‚   └── VisualMRC
β”œβ”€β”€ geo3k                                                       # GLLaVA
|   └── train
β”œβ”€β”€ geoqa_plus                                                  # GLLaVA
β”œβ”€β”€ images                                                      # MathVision
|
β”œβ”€β”€ sharegpt4v_instruct_gpt4-vision_cap100k.json                # ShareGPT4V-Caption
β”œβ”€β”€ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json  # ShareGPT4V-Instruction
β”œβ”€β”€ Evol-Instruct-GPT4-Turbo-143K.json                          # ALLAVA4V-Text
β”œβ”€β”€ SMR.json                                                    # SMR
β”œβ”€β”€ train.jsonl                                                 # DocDownstream
β”œβ”€β”€ detailed_explanation.jsonl                                  # DocReason
β”œβ”€β”€ minigemini_pretrain.json                                    # MiniGemini-Pretrain
β”œβ”€β”€ minigemini_instruction.json                                 # MiniGemini-Instruction
β”œβ”€β”€ gllava_align.parquet                                        # GLLaVA-Align
β”œβ”€β”€ gllava_qa.parquet                                           # GLLaVA-QA
β”œβ”€β”€ mathvision.parquet                                          # MathVision
β”œβ”€β”€ MathInstruct.json                                           # MathInstruct
└── mathplus.parquet                                            # MathPlus

πŸ“‚ Evaluation Benchmarks

These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.

Evaluation Dataset Directory Layout

Evaluation_Dataset_Path
β”œβ”€β”€ ScienceQA                       # SQA-IMG
β”œβ”€β”€ ai2d                            # AI2D
β”œβ”€β”€ chartqa                         # ChartQA
β”œβ”€β”€ SEED-Bench                      # SEED-IMG
β”œβ”€β”€ SEED-Bench-2-plus               # SEED-Bench-2-Plus
β”œβ”€β”€ POPE                            # POPE
β”œβ”€β”€ HallusionBench                  # HallusionBench
β”œβ”€β”€ MME_Benchmark_release_version   # MME
β”œβ”€β”€ MathVista                       # MathVista
β”œβ”€β”€ MMBench                         # MMB
β”œβ”€β”€ mm-vet                          # MM-Vet
β”œβ”€β”€ mm-vet-v2                       # MM-Vet-v2
β”œβ”€β”€ llava-bench-in-the-wild         # LLaVA Bench in the Wild
β”œβ”€β”€ LLaVA-Bench-Wilder              # LLaVA Wilder
β”œβ”€β”€ BLINK                           # BLINK
β”œβ”€β”€ CV-Bench                        # CV-Bench
β”œβ”€β”€ VisualWebBench                  # VisualWebBench
β”œβ”€β”€ MMStar                          # MMStar
└── MathVerse                       # MathVerse

About

[Under Review] Official PyTorch implementation code for realizing the technical part of Phantom of Latent representing equipped with enlarged hidden dimension to build super frontier vision language models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published