ResNet-50 and EfficientNet-b5 in our image system. In model-level fusion, the extracted audio and image embeddings are concatenated.
確定! 回上一頁