The usual approach to deal with this problem in policy gradient algorithms is to "mask out" invalid actions and just sample from the set of ...
確定! 回上一頁