The NeMo Vision Collection is designed to support the multimodal collection, particularly for models like LLAVA that necessitate a vision encoder implementation. At present, the vision collection features support for ViT, a customized version of the transformer model from Megatron core.
Our documentation offers comprehensive insights into each supported model, facilitating seamless integration and utilization within your projects.