GIT is a decoder-only Transformer that leverages CLIP's vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art ...
確定! 回上一頁