We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream ...
確定! 回上一頁