Video-LLaMA’s IT

Posted Jun 5, 2023

By 1 min read

Paper: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
GitHub Link
Publisher: EMNLP 2023
Author Affiliation: Alibaba Group
Type
- SFT
- RLHF
Multi-turn
- ✔
- ✖
Input Modalities $\rightarrow$ Output Modalities
(I: Image, V: Video, A: Audio, 3D: Point Cloud, T: Text, B: Bounding box, Tab: Table, Web: Web page)
- I/V+T $\rightarrow$ T
Source
- MiniGPT-4, LLaVA, and VideoChat’s IT
Method
- Auto.
I/V/A Scale
- I
  - 81K
- V
  - 8K
- A
  - Not report
Dialog Turn
- 2.22
Instance Scale
- 171K

This post is licensed under CC BY 4.0 by the author.

Trending Tags