Skip to content

aiiu-lab/CLIPCAM

Repository files navigation

CLIPCAM: Zero-shot Text-guided Object and Action Localization

Official implementation of CLIPCAM: A Simple Baseline for Zero-shot Object and Action Localization (ICASSP 2022)

Link to paper

clicpam

Table of Contents

Environment Setup

  1. create conda enviroment with Python=3.7
    conda create -n clipcam python=3.7
    conda activate clipcam
  2. install pytorch 1.9.0, torchvision 0.10.0 with compatible cuda version (or any compatible torch version)
    pip install torch==1.9.0 cu111 torchvision==0.10.0 cu111 -f https://download.pytorch.org/whl/torch_stable.html
  3. install required package
    pip install -r requirements.txt

Quick Demo

Website Demo

Please go to this link for a quick demo.
clicpam
*P.S. First time user: please follow the instruction on top of the demo website to allow your browser connecting to my server.

Code Demo

python clipcam.py \ 
    --image_path "{single image path or grid image directory (4 images)}" \ 
    --sentence "{input sentence}" \  
    --gpu_id 0 \ 
    --clip_model_name "ViT-B/16" \ 
    --cam_model_name "GradCAM" \ 

Supported Models for CLIPCAM

CLIP Models (from OpenAI):

example: --clip_model_name ViT-B/16

  • ViT-B/16
  • ViT-B/32
  • RN50
  • RN101
  • RN50x4
  • RN50x16

ImageNet Pre-trained Models

example: --clip_model_name ViT-B/16-pretrained

CAM Variations

CAMs for CLIP (CLIPCAMs) (from pytorch-grad-cam)

example: --cam_model_name GradCAM

  • GradCAM
  • GradCAMPlusPlus
  • XGradCAM
  • ScoreCAM
  • EigenCAM
  • EigenGradCAM
  • GuidedBackpropReLUModel
  • LayerCAM

CAMs for other models (from pytorch-grad-cam)

example: --cam_model_name GradCAM_original

  • GradCAM_original
  • GradCAMPlusPlus_original
  • XGradCAM_original
  • ScoreCAM_original
  • EigenGradCAM_original
  • EigenCAM_original
  • GuidedBackpropReLUModel_original
  • LayerCAM_original

Dataset Preparation

  1. OpenImage V6
    Download OpenImage V6 validation set with data_prep/openimage.py.
  2. HICO-DET
    Download HICO-DET from this link.
  3. ILSVRC (optional)
    Download ILSVRC validation set.
  4. COSMOS (optional)
    Download COSMOS validation set.

Evaluation

grid_v_single

Grid-view Zero-shot Object Localization

grid_localization

object_localization

  1. Dataset structure (OpenImage)
    |--OpenImage
        |--validation
            |--data
                |--{image_path_1}
                |--{image_path_2}
                |-- ...
            |--labels
                |--detections.csv
            |--metadata
                |--classes.csv
    
  2. Run evaluate_grid_openimage.py with any model selection
    python evaluate_grid_openimage.py \
        --data_dir Dataset/OpenImage/validation \
        --gpu_id 0 \
        --clip_model_name 'ViT-B/32' \
        --cam_model_name 'GradCAM' \
        --save_dir 'eval_result/grid/openimage/vitb32-grad' \
        --mask_threshold 0.2 \
        --sentence_prefix 'a photo of ' \
        --attack 'None' \
        --save_result 1
    

Grid-view Zero-shot Action Localization

action_localization

  1. Dataset structure (HICO-DET)

    |--HICO-DET
        |--images
            |--test
                |--{image_path_1}
                |--{image_path_2}
                |-- ...
            |--train
                |--{image_path_1}
                |--{image_path_2}
                |-- ...
        |--anno.mat
        |--anno_bbox.mat
    
  2. Run verb_grid.py for pre-trained model
    Train the model with half of the classes in HICO-DET.
    Or download the fine-tuned checkpoints from this OneDrive.
    --train_mode with 'full', 'few' and 'half' specifies the setting when loading the classes in HICO-DET dataset.

    python verb_grid.py \
        --data_dir datasets/hico-det \
        --gpu_id 0 \
        --clip_model_name 'ViT-B/32-pretrained' \
        --cam_model_name 'GradCAM_original' \
        --save_dir 'eval_result/grid/hicodet/vitb32-pretrained-grad' \
        --mask_threshold 0.2 \
        --train_mode 'half' \
        --model_name checkpoints/models/vitb32-pretrained-half-1e-6.pth \ 
        --save_result 1
    
  3. Run verb_grid.py for CLIPCAM

    python verb_grid.py \
        --data_dir dataset/hico-det \
        --gpu_id 0 \
        --clip_model_name 'ViT-B/32' \
        --cam_model_name 'GradCAM' \
        --save_dir 'eval_result/grid/hicodet/vitb32-grad' \
        --mask_threshold 0.2 \
        --save_result 1
    

Single-image Zero-shot Object Localization

  1. OpenImage
    a. Run evaluate_openimage.py

    python evaluate_openimage.py \
        --data_dir datasets/OpenImage/validation \
        --gpu_id 0 \
        --clip_model_name 'ViT-B/32' \
        --cam_model_name 'GradCAM' \
        --save_dir 'eval_result/single/openimage/vitb32-grad' \
        --save_result 1 \
        --sentence_prefix 'a photo of ' \
        --distill_num 0 \
        --attack 'None'
    
  2. ILSVRC
    ilsvrc
    a. Dataset Structure

    |--ImageNet
        |--validation
            |--{label_1}\
                |--{image_path_1}
                |--{image_path_2}
                |-- ...
            |--{label_2}
            |-- ...
        |--bbox
            |--val
                |--{image_path_1}.xml
                |--{image_path_1}.xml
                |-- ...
    

    b. Run evaluate_imagenet.py

    python evaluate_imagenet.py \
        --data_dir dataset/ImageNet/validation \
        --gpu_id 0 \
        --clip_model_name 'ViT-B/32' \
        --cam_model_name 'GradCAM' \
        --save_dir 'eval_result/single/imagenet/vitb32-grad' \
        --batch 128 \
        --save_result 1 \
        --sentence_prefix 'sentence' \
        --attack 'None'
    
  3. COSMOS, OpenImage and custom images
    cosmos
    a. Run evaluate.py with --dataset cosmos or --dataset openimage

    python evaluate.py \
        --data_dir datasets/COSMOS/val \
        --gpu_id 0 \
        --dataset cosmos \
        --clip_model_name 'ViT-B/32' \
        --cam_model_name 'GradCAM' \
        --save_dir 'eval_result/single/cosmos/vitb32-grad' \
        --distill_num 0 \
        --attack 'None'
    
  4. Test on images with custom guiding text a. Put the images in a folder and run evaluate.py b. Run evaluate.py without specifying --dataset

    python evaluate.py \
        --data_dir {path to folder} \
        --gpu_id 0 \
        --clip_model_name 'ViT-B/32' \
        --cam_model_name 'GradCAM' \
        --save_dir 'eval_result/custom-input-vitb32-grad'  \
        --distill_num 0
    

Other features

Iterative Mask

We propose an iterative refinement method based on masking out high neural importance areas to expand attention or enhance weak response regions.
iterative_mask
Set --distill_num {n} to iteratively mask out {n} times.

Weather Attacks

We experimented the ability of CLIPCAM to handle attacked images.
attack
Set --attack fog or --attack snow to create fog or snow attack.

Citing

If you find the paper or the code useful for your study, please consider citing the CLIPCAM paper:

@article{clipcam_hsia_icassp2022,
    author = {Hsia, Hsuan-An and Lin, Che-Hsien and Kung, Bo-Han and Chen, Jhao-Ting and Tan, Daniel Stanley and Chen, Jun-Cheng and Hua, Kai-Lung},
    title = "{CLIPCAM: A Simple Baseline for Zero-shot Text-guided Object and Action Localization}",
    booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year = {2022}
}

Contact Us

If you have questions regarding the paper or code, please open an issue or email us: Jhao-Ting Chen or Che-Hsien Lin. We will get back to you as soon as possible.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages