The generator takes each audio segment embedded in self-supervised representations and predicts a phoneme corresponding to a sound in language.
確定! 回上一頁