Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substantially more parameters than OpenCLIP, SWAG and Timm"s ViT models. #7

Closed
yash0307 opened this issue May 14, 2023 · 5 comments
Closed

Comments

@yash0307
Copy link

Hello,

Many thanks for sharing your interesting work. I noticed that the projection head of your models is substantially bigger than SWAG (Singh et al., CVPR 2022), OpenCLIP models and Timm"s implementation of ViT that is used in recall@k surrogate (Patel et al., CVPR 2022). I ran a quick parameter counter for these models following the RS@k implementation, that is, with a layer norm and linear projection. Here are the counts:

ViT-B/32 Timm: 87850496
ViT-B/32 CLIP: 87849728
ViT-B/32 UNICOM: 117118464
ViT-B/16 Timm: 86193920
ViT-B/16 CLIP: 86193152
ViT-B/16 UNICOM: 202363136
ViT-B/16 SWAG: 86193920

It is clear that the UNICOM model has substantially higher number of parameters than the baselines used for the comparison. With this in mind, are the comparisons fair at all?

@anxiangsir
Copy link
Collaborator

Greetings, thank you for showing interest in our research work.

  1. The projection head structure used in our ViT model is taken from the paper that follows the arcface.
    https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/fresnet.py#L1101
    https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/symbol_utils.py#L78
  2. We will shortly update the experiment results on Github with a projection head structure similar to CLIP.

@yash0307
Copy link
Author

Thank you for the prompt reply. Looking forward to new results.

@anxiangsir
Copy link
Collaborator

anxiangsir commented Jul 3, 2023

This is the performance of the model we trained using the same ViT architecture as CLIP.

Results

cub car sop inshop inat
unicom 83.7 95.9 70.0 72.8 64.6
new 83.4 95.5 71.0 75.0 64.9

Model

This is the model file:
https://drive.google.com/file/d/1dSrWAmoPqr8d9oB1wggnZfgHnBdre2wa/view?usp=sharing

Usage

You can use it like this:

import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True)

@yaojunr
Copy link

yaojunr commented Jul 11, 2023

This is the performance of the model we trained using the same ViT architecture as CLIP.

Results

cub car sop inshop inat
unicom 83.7 95.9 70.0 72.8 64.6
new 83.4 95.5 71.0 75.0 64.9

Model

This is the model file: https://drive.google.com/file/d/1dSrWAmoPqr8d9oB1wggnZfgHnBdre2wa/view?usp=drive_link

Usage

You can use it like this:

import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True)

HI, the model file has no permission to download, can you open the permission? Thank you very much.

@anxiangsir
Copy link
Collaborator

This is the performance of the model we trained using the same ViT architecture as CLIP.

Results

cub car sop inshop inat
unicom 83.7 95.9 70.0 72.8 64.6
new 83.4 95.5 71.0 75.0 64.9

Model

This is the model file: https://drive.google.com/file/d/1dSrWAmoPqr8d9oB1wggnZfgHnBdre2wa/view?usp=drive_link

Usage

You can use it like this:

import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True)

HI, the model file has no permission to download, can you open the permission? Thank you very much.

we have updated it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants