to the advantage actor-critic algorithm (A2C), allowing off-policy ... Accumulate gradients dθ ← dθ + ¯ρi∇θ log F(ai|si){R −. V (si; θv)}.
確定! 回上一頁