-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Substantially more parameters than OpenCLIP, SWAG and Timm"s ViT models. #7
Comments
Greetings, thank you for showing interest in our research work.
|
Thank you for the prompt reply. Looking forward to new results. |
This is the performance of the model we trained using the same ViT architecture as CLIP. Results
ModelThis is the model file: UsageYou can use it like this: import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True) |
HI, the model file has no permission to download, can you open the permission? Thank you very much. |
we have updated it |
Hello,
Many thanks for sharing your interesting work. I noticed that the projection head of your models is substantially bigger than SWAG (Singh et al., CVPR 2022), OpenCLIP models and Timm"s implementation of ViT that is used in recall@k surrogate (Patel et al., CVPR 2022). I ran a quick parameter counter for these models following the RS@k implementation, that is, with a layer norm and linear projection. Here are the counts:
It is clear that the UNICOM model has substantially higher number of parameters than the baselines used for the comparison. With this in mind, are the comparisons fair at all?
The text was updated successfully, but these errors were encountered: