Video-LLaMA’s IT
- Paper: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
- GitHub Link
- Publisher:
EMNLP 2023
- Author Affiliation:
Alibaba Group
- Type
- SFT
- RLHF
- Multi-turn
- ✔
- ✖
- Input Modalities $\rightarrow$ Output Modalities
(I: Image, V: Video, A: Audio, 3D: Point Cloud, T: Text, B: Bounding box, Tab: Table, Web: Web page)- I/V+T $\rightarrow$ T
- Source
MiniGPT-4, LLaVA, and VideoChat’s IT
- Method
Auto.
- I/V/A Scale
- I
81K
- V
8K
- A
Not report
- I
- Dialog Turn
2.22
- Instance Scale
171K
This post is licensed under CC BY 4.0 by the author.