Multimodal
Resources: Multimodal RAG: Chat with Videos
- You can check out the code for Multimodal Embeddings, Multimodal Preprocessing, Multimodal Retrieval from Vector Stores, and Large Vision-Language Models (LVLMs).
Cross-Modal Encoder: Bridge Tower
LVLM
Resources: LMM prompting with Gemini
Gemini family consists of:
- Ultra: largest model for highly complex tasks
- Pro: best model for general performance across wide range of tasks
- Flash: lightweight model, opyimized for speed and efficiency
- Nano: most efficient model for on-device tasks (Model Distillation)
Enjoy Reading This Article?
Here are some more articles you might like to read next: