Post

MosIT

  • Paper: NExT-GPT: Any-to-Any Multimodal LLM
  • GitHub Link
  • Publisher: ICLR 2024
  • Author Affiliation: National University of Singapore
  • Type
    • SFT
    • RLHF
  • Multi-turn
  • Input Modalities $\rightarrow$ Output Modalities
    (I: Image, V: Video, A: Audio, 3D: Point Cloud, T: Text, B: Bounding box, Tab: Table, Web: Web page)
    • I+V+A+T $\rightarrow$ I+V+A+T
  • Source
    • Youtube, Google, Flickr30k, Midjourney, etc.
  • Method
    • Auto. + Manu.
  • I/V/A Scale
    • I
      • 4K
    • V
      • 4K
    • A
      • 4K
  • Dialog Turn
    • 4.8
  • Instance Scale
    • 5K
This post is licensed under CC BY 4.0 by the author.