2
Q學習 - 獎勵
我掙扎解釋了Q學習算法的僞代碼:Q學習:什麼是獎勵計算正確的狀態
1 For each s, a initialize table entry Q(a, s) = 0
2 Observe current state s
3 Do forever:
4 Select an action a and execute it
5 Receive immediate reward r
6 Observe the new state s′ ← δ(a, s)
7 Update the table entry for Q(a, s) as follows:
8 Q(a, s) ← R(s) + γ * max Q(a′, s′)
9 s ← s′
應該獎勵從收集後續狀態s'
或當前狀態s
?