LiT models learn to match text to an already pre-trained image encoder. This simple yet effective setup provides the best of both worlds: strong ...
確定! 回上一頁