Figure 1. We learn a joint video-text embedding space from in- structional videos and accompanying action-adverb pairs in the narration. Within this ...
確定! 回上一頁