This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues. We ...
確定! 回上一頁