MosIT
- Paper: NExT-GPT: Any-to-Any Multimodal LLM
- GitHub Link
- Publisher:
ICLR 2024
- Author Affiliation:
National University of Singapore
- Type
- SFT
- RLHF
- Multi-turn
- ✔
- ✖
- Input Modalities $\rightarrow$ Output Modalities
(I: Image, V: Video, A: Audio, 3D: Point Cloud, T: Text, B: Bounding box, Tab: Table, Web: Web page)- I+V+A+T $\rightarrow$ I+V+A+T
- Source
Youtube, Google, Flickr30k, Midjourney, etc.
- Method
Auto. + Manu.
- I/V/A Scale
- I
4K
- V
4K
- A
4K
- I
- Dialog Turn
4.8
- Instance Scale
5K
This post is licensed under CC BY 4.0 by the author.