Q學習 - 獎勵

我掙扎解釋了Q學習算法的僞代碼：Q學習：什麼是獎勵計算正確的狀態

1 For each s, a initialize table entry Q(a, s) = 0 
2 Observe current state s 
3 Do forever: 
4  Select an action a and execute it 
5  Receive immediate reward r 
6  Observe the new state s′ ← δ(a, s) 
7  Update the table entry for Q(a, s) as follows: 
8  Q(a, s) ← R(s) + γ * max Q(a′, s′) 
9  s ← s′

應該獎勵從收集後續狀態s'或當前狀態s？

來源

2014-04-02 OccamsMan

獎勵應從您執行動作a後輸入的後續狀態收集。

來源

2014-04-02 08:20:57 jorgenkg

Q學習：什麼是獎勵計算正確的狀態

Q學習 - 獎勵

回答

相關問題